Multiperspectivity as a Resource for Narrative Similarity Prediction

cs.CL Max Upravitelev, Veronika Solopova, Jing Yang, Charlott Jakob, Premtim Sahitaj, Ariana Sahitaj, Vera Schmitt · Mar 23, 2026
Local to this browser
What it does
Narrative similarity is inherently interpretive—different valid readings can yield divergent judgments, challenging benchmarks that encode single ground truths. This paper proposes embracing multiperspectivity by ensembling 31 LLM...
Why it matters
The approach leverages Condorcet Jury Theorem-like dynamics to improve accuracy, achieving 0. 705 on SemEval-2026 Task 4 while revealing that diverse practitioner perspectives yield better ensemble gains despite lower individual...
Main concern
The paper presents a compelling empirical case that interpretive diversity improves narrative similarity prediction. The core finding—that practitioner personas achieve 76.
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Narrative similarity is inherently interpretive—different valid readings can yield divergent judgments, challenging benchmarks that encode single ground truths. This paper proposes embracing multiperspectivity by ensembling 31 LLM personas, ranging from literary critics to lay characters, to predict which of two stories is more similar to an anchor. The approach leverages Condorcet Jury Theorem-like dynamics to improve accuracy, achieving 0.705 on SemEval-2026 Task 4 while revealing that diverse practitioner perspectives yield better ensemble gains despite lower individual performance.

Critical review
Verdict
Bottom line

The paper presents a compelling empirical case that interpretive diversity improves narrative similarity prediction. The core finding—that practitioner personas achieve 76.0% majority vote accuracy despite only 71.0% mean individual accuracy—is well-supported by error correlation analysis showing practitioners produce 19% less correlated errors than lay personas ($r = 0.388$ vs $0.461$). However, the interpretation of gender-vocabulary correlations as evidence of "valid interpretations absent from the ground truth" (Section 6) overreaches; the authors acknowledge correlation does not imply causation, yet still frame the finding as revealing benchmark limitations rather than potential model biases or spurious associations.

“Practitioner personas perform worse individually but produce less correlated errors, yielding larger ensemble gains under majority voting”
paper · Abstract
“Error correlation (r) ... [P] 0.388 ... [L] 0.461”
paper · Table 5
“valid interpretations absent from the ground truth”
paper · Section 6
What holds up

The demonstration that ensemble diversity improves prediction quality is robust and consistent across model families. Accuracy improves monotonically with ensemble size from 66.0% at $E=1$ to 75.2% at $E=31$ (Table 2), and the oracle analysis confirms that 99.1% of items are solvable by at least one persona on Qwen3 (Table 3). The error diversity mechanism is validated by pairwise correlations and double-fault rates (Table 5), showing practitioners fail together less often ($P(\text{both wrong}) = 0.164$ vs $0.173$). The methodological choice to use sampling consistency over verbalized confidence as an uncertainty metric is empirically justified and aligns with established findings on LLM calibration.

“E=31 ... 75.2”
paper · Table 2
“K≥1 ... 99.1%±0.4%”
paper · Table 3
“sampling consistency was the substantially stronger predictor”
paper · Appendix A
Main concerns

The optimized $k=8$ ensemble shows severe overfitting: it achieves 82.0% on the dev set but only 0.705 on test, an 11.5 percentage point gap compared to 5.6 for the full ensemble (Section 4). Given only 200 dev items, subset optimization is unreliable, yet this finding is underemphasized. The gender-vocabulary analysis (Table 6) shows small but significant negative correlations ($|r_{pb}| \leq 0.046$) for terms like "gender roles" and "female protagonist," but the authors admit this "may co-occur with particular narrative properties of the input texts rather than causally driving incorrect predictions" (Limitations). The paper also claims to compare against multi-agent debate but explicitly states "our experiments... do not evaluate multi-agent debate systems," making it impossible to assess whether simple voting is actually preferable to deliberative approaches.

“The 11.5 percentage point dev-test gap (vs. 5.6 for the full ensemble) suggests overfitting to the small dev set (200 items)”
paper · Section 4 Further Results
“the presence of gender-analytical terms may co-occur with particular narrative properties of the input texts rather than causally driving incorrect predictions”
paper · Limitations
“our experiments investigate majority voting... but do not evaluate multi-agent debate (MAD) systems”
paper · Limitations
Evidence and comparison

The evidence supports ensemble efficacy but lacks critical baselines. While the comparison between practitioner and lay ensembles is thorough (Tables 4 and 5), the paper does not establish whether multiperspectivity outperforms simpler strategies like self-consistency, chain-of-thought prompting, or single-expert prompts with the same computational budget. The Condorcet Jury Theorem framing is appropriately qualified—the authors note the independence assumption is violated (substantial pairwise correlations exist) and cite limits of LLM ensembles. However, without human performance baselines on the same 400 test items, it remains unclear whether 70.5% accuracy represents progress on interpretive alignment or merely reflects the difficulty of the benchmark.

“substantial pairwise error correlations (r=.388 for [P], r=.461 for [L])”
paper · Section 6
“results across domains remain mixed”
paper · Section 1
Reproducibility

The methodology is transparently documented with full system prompts (Table 1, Appendix C) and structured generation schemas using Pydantic and vLLM (Appendix G). The use of open-weight models (Gemma 3 27B-it, Qwen3-14B, gpt-oss-20b) supports independent reproduction. However, the paper does not state whether code, data, or prediction logs will be released. Critical hyperparameters for the main experiments (beyond temperature $t=1$ for consistency analysis in Appendix A) are not specified, and the exhaustive search procedure for subset optimization (Appendix D) lacks implementation details—particularly how the authors handled the combinatorial explosion when selecting from 31 personas. The absence of runtimes or computational cost estimates further impedes reproducibility assessment.

“structured output enforced by the vllm Kwon et al. inference engine containing the following keys”
paper · Appendix G
“we run each persona 10 times at temperature t=1”
paper · Appendix A
Abstract

Predicting narrative similarity can be understood as an inherently interpretive task: different, equally valid readings of the same text can produce divergent interpretations and thus different similarity judgments, posing a fundamental challenge for semantic evaluation benchmarks that encode a single ground truth. Rather than treating this multiperspectivity as a challenge to overcome, we propose to incorporate it in the decision making process of predictive systems. To explore this strategy, we created an ensemble of 31 LLM personas. These range from practitioners following interpretive frameworks to more intuitive, lay-style characters. Our experiments were conducted on the SemEval-2026 Task 4 dataset, where the system achieved an accuracy score of 0.705. Accuracy improves with ensemble size, consistent with Condorcet Jury Theorem-like dynamics under weakened independence. Practitioner personas perform worse individually but produce less correlated errors, yielding larger ensemble gains under majority voting. Our error analysis reveals a consistent negative association between gender-focused interpretive vocabulary and accuracy across all persona categories, suggesting either attention to dimensions not relevant for the benchmark or valid interpretations absent from the ground truth. This finding underscores the need for evaluation frameworks that account for interpretive plurality.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.