Multi-Perspective LLM Annotations for Valid Analyses in Subjective Tasks

cs.CL Navya Mehrotra, Adam Visokay, Kristina Gligori\'c · Mar 22, 2026
Local to this browser
What it does
LLM annotations encode some human perspectives better than others, especially in subjective tasks where demographic background shapes judgments. This paper introduces Perspective-Driven Inference (PDI), a statistical framework that treats...
Why it matters
This paper introduces Perspective-Driven Inference (PDI), a statistical framework that treats the distribution of group-specific annotations as a vector estimand $\theta^* = (\theta^*_{g_1}, \dots, \theta^*_{g_K})$ and adaptively allocates...
Main concern
The paper presents a well-motivated extension of Prediction-Powered Inference (PPI) to multi-perspective settings where ground truth is inherently plural. The method is statistically sound—using inverse probability weighting to correct for...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

LLM annotations encode some human perspectives better than others, especially in subjective tasks where demographic background shapes judgments. This paper introduces Perspective-Driven Inference (PDI), a statistical framework that treats the distribution of group-specific annotations as a vector estimand $\theta^* = (\theta^*_{g_1}, \dots, \theta^*_{g_K})$ and adaptively allocates limited human labels to groups where LLM proxies are least reliable. The core contribution is an error-predictor-driven sampling rule that improves estimation accuracy for harder-to-model demographics while maintaining valid frequentist coverage.

Critical review
Verdict
Bottom line

The paper presents a well-motivated extension of Prediction-Powered Inference (PPI) to multi-perspective settings where ground truth is inherently plural. The method is statistically sound—using inverse probability weighting to correct for adaptive sampling bias—and empirically demonstrates that persona prompting, the dominant baseline for demographic simulation, systematically fails for the groups it purports to represent. The evaluation is limited to two subjective rating tasks from a single dataset, and the framework does not address intersectional identities (e.g., older women vs. older men), which constrains its applicability to real-world demographic complexity.

“Existing methods for correcting LLM annotation error assume a single ground truth. However, this assumption fails in subjective tasks where disagreement across demographic groups is meaningful.”
paper · Abstract
“Persona prompting is the most common method for capturing demographic perspectives with LLMs, yet it fails precisely for the groups it is meant to represent, and in several cases leaves coverage at zero.”
paper · Section 4.4
What holds up

The theoretical grounding in PPI++ is solid, and the extension to vector estimands with group-specific rectification is novel and well-executed. The empirical finding that persona prompting is unreliable—and actually worse than few-shot prompting for offensiveness detection—is an important practical contribution that challenges standard practice. The synthetic robustness experiments convincingly demonstrate that PDI provides the greatest benefits when group disparities are large and minority groups are small (under 10%), which are precisely the conditions where uniform sampling fails most severely.

“For age 35–49, the persona variant achieves 0% coverage... In contrast, both PPI and PDI maintain 95.0% average coverage across all three age groups.”
paper · Table 3
“The precision advantage is largest when the human annotation budget is small, when the LLM makes more errors on the minority group, and when the minority group is small.”
paper · Appendix H
Main concerns

The evaluation scope is narrow: only two tasks (politeness and offensiveness) from a single dataset (POPQUORN), which limits generalization to other subjective domains like toxicity or empathy. Critically, the stratification treats demographic axes independently—gender, age, and education are modeled separately—so the framework cannot capture intersectional effects where being an older woman may produce distinct perspectives from the sum of age and gender effects separately. The error predictor relies solely on demographic features rather than text features, leaving performance gains unrealized, and the burn-in phase requires costly uniform sampling that may be prohibitive for very small budget regimes.

“we stratify by gender, age, and education independently, but these axes correlate in practice. An annotator who is older and identifies as a particular gender occupies an intersection that our per-axis treatment does not capture.”
paper · Limitations
“The error predictor is a gradient-boosted tree trained on demographic features alone; richer features, including linguistic properties of the text, could improve predictions and allocate the budget more efficiently.”
paper · Limitations
Evidence and comparison

The evidence supports the claim that PDI improves inference for targeted underrepresented groups, but not that it dominates uniform PPI overall. On aggregate metrics averaged across all groups, PDI (Δ = 7.10) does not consistently outperform uniform PPI (Δ = 6.35) for politeness, consistent with the authors' framing of PDI as a targeted subgroup-improvement method rather than a universal alternative. The comparison to LLM-only methods is stark and valid: for Age 50+ on offensiveness, LLM variants exceed 24% delta while PDI achieves 5.24%, demonstrating the necessity of human rectification for valid inference on poorly-modeled demographics.

“Averaged across age groups, the LLM-only methods are comparable to PPI and non-rectified baselines... For Age 50+, all three LLM-only variants exceed 24% delta, with persona reaching 28.40%. PDI achieves the lowest delta for Age 50+ at 5.24%.”
paper · Section 4.1
“On aggregate metrics, PDI does not consistently outperform uniform PPI. But PDI is designed for settings where the main concern is not average performance across all groups, but better inference for those groups that are hardest for LLM proxies to capture.”
paper · Section 5
Reproducibility

The study uses the publicly available POPQUORN dataset and reports detailed prompting strategies in Appendix A. Key hyperparameters are specified: 20 trials per condition, burn-in phase followed by batch-wise adaptive sampling with smoothing, and XGBoost for the error predictor. The paper evaluates eight diverse models (GPT-5.2, Claude variants, Gemini, Llama, Mistral) via OpenRouter with default parameters. However, the code repository is not mentioned, the exact random seeds are not provided, and the IPW estimator implementation details (Appendix B) rely on non-standard bootstrap procedures that would require careful verification to reproduce independently.

“Models were accessed using Open Router, using default parameters such as temperature and max token outputs.”
paper · Appendix A
“Code available at https://github.com/aangelopoulos/ppi_py”
Angelopoulos et al., PPI++ · arXiv:2311.01453
Abstract

Large language models are increasingly used to annotate texts, but their outputs reflect some human perspectives better than others. Existing methods for correcting LLM annotation error assume a single ground truth. However, this assumption fails in subjective tasks where disagreement across demographic groups is meaningful. Here we introduce Perspective-Driven Inference, a method that treats the distribution of annotations across groups as the quantity of interest, and estimates it using a small human annotation budget. We contribute an adaptive sampling strategy that concentrates human annotation effort on groups where LLM proxies are least accurate. We evaluate on politeness and offensiveness rating tasks, showing targeted improvements for harder-to-model demographic groups relative to uniform sampling baselines, while maintaining coverage.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.