More Isn't Always Better: Balancing Decision Accuracy and Conformity Pressures in Multi-AI Advice
As users increasingly consult multiple large language models for decision support, a critical question arises: does increasing the number of AI advisors improve accuracy or amplify harmful conformity pressures? This paper investigates how panel size, within-panel consensus, and human-likeness of presentation shape human reliance and decision accuracy across three prediction tasks (income, recidivism, and dating). Through two crowdsourced experiments with 348 participants, the authors reveal a surprising non-monotonic relationship: three AI advisors improve accuracy over a single advisor, but five provide no additional benefit, while unanimous consensus fosters overreliance and wide disagreement creates confusion.
The paper provides a timely and well-structured investigation into multi-AI decision support, offering valuable empirical evidence that more advisors are not always better. The finding that medium-sized panels (three AIs) outperform both single advisors and larger panels (five AIs) represents a meaningful contribution to the human-AI collaboration literature. However, the study's generalizability is constrained by its reliance on a Japanese crowdworker sample and artificial decision-tree advisors rather than actual LLMs, limiting direct applicability to emerging multi-LLM interfaces. The work successfully bridges social psychology research on conformity with HCI design concerns, though some claims about 'conformity pressure' rely heavily on correlational subjective measures rather than behavioral manipulation checks.
The paper's core empirical finding—that decision accuracy saturates at three advisors—holds robustly across multiple tasks and represents a practical design insight for multi-AI systems. The use of a Rashomon set (decision trees sampled from a random forest with equivalent $70\%$ accuracy but diverse predictions) provides excellent experimental control over advisor competence while naturally generating consensus and disagreement patterns. The gradient of consensus effects is particularly compelling: unanimous agreement drives overreliance (Agreement Fraction $0.97$ in consensus vs. $0.73$ in divergence), single dissent reduces conformity pressure, and $3\text{--}2$ splits create decision paralysis. The methodology explicitly distinguishing informational from normative conformity through survey items $Q_3$ and $Q_6$ adds theoretical rigor.
The study suffers from significant external validity limitations. Participants were Japanese crowdworkers (mean age $44.5$), and the authors acknowledge that 'conformity varies by culture,' yet make broad design claims without cross-cultural validation. More critically, the 'AI advisors' were decision trees from random forests presented through an LLM-generated explanation layer—not actual LLM agents. This abstraction misses critical phenomena in real multi-LLM consultation where models may have correlated errors, shared training data, or conversational dynamics. The human-likeness manipulation (Study 2) showed null effects on accuracy despite successful manipulation checks, suggesting either insufficient power ($n=88$ across three tasks) or that the chosen anthropomorphic cues (static photos, names) lack ecological validity compared to voice or interactive persona. The authors' claim that 'panel consensus can elicit informational conformity' relies on correlational evidence (Table 6) where $R \approx 0.42$ for pressure-reliance relationships—directionality remains unclear.
The paper adequately positions itself against prior work, contrasting its discrete-panel approach with Lu et al.'s sequential second-opinion paradigm and Song et al.'s (2024) finding that informational conformity may not arise with AI panels. However, the comparison to Song et al. is somewhat strained—the current study uses accuracy-grounded tasks where correctness is verifiable, while Song et al. examined opinion change on societal issues without ground truth. The claim that 'simple statistical aggregation usually outperforms humans' deliberative selection' is supported by classic literature (Soll and Larrick, 2009), yet the paper does not empirically test whether participants would perform better with an explicit majority-vote aggregation compared to the sequential advice presentation used. The relationship to Asch conformity studies is well-articulated, though the translation from perceptual line-judgment tasks to predictive inference merits deeper theoretical justification.
Reproducibility is moderately supported but incomplete. The authors specify using GPT-4o via the OpenAI API for explanation generation with detailed prompts (Figure 3), and the Rashomon set construction from random forests is algorithmically precise ($70\%$ accuracy selection). However, the paper makes no mention of code, data, or stimuli availability—neither the decision tree models, the specific feature attributions, nor the explanation generation prompts are publicly archived. The attention-check methodology (selecting features highlighted by AI) is described but not validated for efficacy. Hyperparameters for the random forest (number of trees, feature sampling) are unspecified. Without access to the exact tree structures or the GPT-4o prompt templates, independent reproduction of the advice stimuli would be impossible. The study protocol appears rigorous (three attention checks, comprehension quizzes), yet the lack of open materials significantly impedes replication efforts.
Just as people improve decision-making by consulting diverse human advisors, they can now also consult with multiple AI systems. Prior work on group decision-making shows that advice aggregation creates pressure to conform, leading to overreliance. However, the conditions under which multi-AI consultation improves or undermines human decision-making remain unclear. We conducted experiments with three tasks in which participants received advice from panels of AIs. We varied panel size, within-panel consensus, and the human-likeness of presentation. Accuracy improved for small panels relative to a single AI; larger panels yielded no gains. The level of within-panel consensus affected participants' reliance on AI advice: High consensus fostered overreliance; a single dissent reduced pressure to conform; wide disagreement created confusion and undermined appropriate reliance. Human-like presentations increased perceived usefulness and agency in certain tasks, without raising conformity pressure. These findings yield design implications for presenting multi-AI advice that preserve accuracy while mitigating conformity.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.