Multi-Perspective LLM Annotations for Valid Analyses in Subjective Tasks
LLM annotations encode some human perspectives better than others, especially in subjective tasks where demographic background shapes judgments. This paper introduces Perspective-Driven Inference (PDI), a statistical framework that treats the distribution of group-specific annotations as a vector estimand $\theta^* = (\theta^*_{g_1}, \dots, \theta^*_{g_K})$ and adaptively allocates limited human labels to groups where LLM proxies are least reliable. The core contribution is an error-predictor-driven sampling rule that improves estimation accuracy for harder-to-model demographics while maintaining valid frequentist coverage.
The paper presents a well-motivated extension of Prediction-Powered Inference (PPI) to multi-perspective settings where ground truth is inherently plural. The method is statistically sound—using inverse probability weighting to correct for adaptive sampling bias—and empirically demonstrates that persona prompting, the dominant baseline for demographic simulation, systematically fails for the groups it purports to represent. The evaluation is limited to two subjective rating tasks from a single dataset, and the framework does not address intersectional identities (e.g., older women vs. older men), which constrains its applicability to real-world demographic complexity.
The theoretical grounding in PPI++ is solid, and the extension to vector estimands with group-specific rectification is novel and well-executed. The empirical finding that persona prompting is unreliable—and actually worse than few-shot prompting for offensiveness detection—is an important practical contribution that challenges standard practice. The synthetic robustness experiments convincingly demonstrate that PDI provides the greatest benefits when group disparities are large and minority groups are small (under 10%), which are precisely the conditions where uniform sampling fails most severely.
The evaluation scope is narrow: only two tasks (politeness and offensiveness) from a single dataset (POPQUORN), which limits generalization to other subjective domains like toxicity or empathy. Critically, the stratification treats demographic axes independently—gender, age, and education are modeled separately—so the framework cannot capture intersectional effects where being an older woman may produce distinct perspectives from the sum of age and gender effects separately. The error predictor relies solely on demographic features rather than text features, leaving performance gains unrealized, and the burn-in phase requires costly uniform sampling that may be prohibitive for very small budget regimes.
The evidence supports the claim that PDI improves inference for targeted underrepresented groups, but not that it dominates uniform PPI overall. On aggregate metrics averaged across all groups, PDI (Δ = 7.10) does not consistently outperform uniform PPI (Δ = 6.35) for politeness, consistent with the authors' framing of PDI as a targeted subgroup-improvement method rather than a universal alternative. The comparison to LLM-only methods is stark and valid: for Age 50+ on offensiveness, LLM variants exceed 24% delta while PDI achieves 5.24%, demonstrating the necessity of human rectification for valid inference on poorly-modeled demographics.
The study uses the publicly available POPQUORN dataset and reports detailed prompting strategies in Appendix A. Key hyperparameters are specified: 20 trials per condition, burn-in phase followed by batch-wise adaptive sampling with smoothing, and XGBoost for the error predictor. The paper evaluates eight diverse models (GPT-5.2, Claude variants, Gemini, Llama, Mistral) via OpenRouter with default parameters. However, the code repository is not mentioned, the exact random seeds are not provided, and the IPW estimator implementation details (Appendix B) rely on non-standard bootstrap procedures that would require careful verification to reproduce independently.
Large language models are increasingly used to annotate texts, but their outputs reflect some human perspectives better than others. Existing methods for correcting LLM annotation error assume a single ground truth. However, this assumption fails in subjective tasks where disagreement across demographic groups is meaningful. Here we introduce Perspective-Driven Inference, a method that treats the distribution of annotations across groups as the quantity of interest, and estimates it using a small human annotation budget. We contribute an adaptive sampling strategy that concentrates human annotation effort on groups where LLM proxies are least accurate. We evaluate on politeness and offensiveness rating tasks, showing targeted improvements for harder-to-model demographic groups relative to uniform sampling baselines, while maintaining coverage.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.