More Isn't Always Better: Balancing Decision Accuracy and Conformity Pressures in Multi-AI Advice

cs.HC cs.AI Yuta Tsuchiya, Yukino Baba · Mar 23, 2026

What it does

Why it matters

This paper investigates how panel size, within-panel consensus, and human-likeness of presentation shape human reliance and decision accuracy across three prediction tasks (income, recidivism, and dating). Through two crowdsourced...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

As users increasingly consult multiple large language models for decision support, a critical question arises: does increasing the number of AI advisors improve accuracy or amplify harmful conformity pressures? This paper investigates how panel size, within-panel consensus, and human-likeness of presentation shape human reliance and decision accuracy across three prediction tasks (income, recidivism, and dating). Through two crowdsourced experiments with 348 participants, the authors reveal a surprising non-monotonic relationship: three AI advisors improve accuracy over a single advisor, but five provide no additional benefit, while unanimous consensus fosters overreliance and wide disagreement creates confusion.

Critical review

Verdict

Bottom line

The paper provides a timely and well-structured investigation into multi-AI decision support, offering valuable empirical evidence that more advisors are not always better. The finding that medium-sized panels (three AIs) outperform both single advisors and larger panels (five AIs) represents a meaningful contribution to the human-AI collaboration literature. However, the study's generalizability is constrained by its reliance on a Japanese crowdworker sample and artificial decision-tree advisors rather than actual LLMs, limiting direct applicability to emerging multi-LLM interfaces. The work successfully bridges social psychology research on conformity with HCI design concerns, though some claims about 'conformity pressure' rely heavily on correlational subjective measures rather than behavioral manipulation checks.

“AI_3 (M≈.737) achieved significantly higher accuracy than the single-AI condition (M≈.706,p=.012). AI_5 did not significantly differ from AI_1.”

Tsuchiya and Baba, Sec. 4.2.1 · Section 4.2.1

“In DIV_3, no significant pre-post accuracy differences were found in any task. This suggests that a near-even split among AI opinions created uncertainty and undermined effective decision-making.”

Tsuchiya and Baba, Sec. 4.3.1 · Section 4.3.1

What holds up

The paper's core empirical finding—that decision accuracy saturates at three advisors—holds robustly across multiple tasks and represents a practical design insight for multi-AI systems. The use of a Rashomon set (decision trees sampled from a random forest with equivalent $70\%$ accuracy but diverse predictions) provides excellent experimental control over advisor competence while naturally generating consensus and disagreement patterns. The gradient of consensus effects is particularly compelling: unanimous agreement drives overreliance (Agreement Fraction $0.97$ in consensus vs. $0.73$ in divergence), single dissent reduces conformity pressure, and $3\text{--}2$ splits create decision paralysis. The methodology explicitly distinguishing informational from normative conformity through survey items $Q_3$ and $Q_6$ adds theoretical rigor.

“Among the obtained models, we identified 30–36 trees that achieved exactly 70% accuracy (35 correct out of 50 test cases) and adopted them as the Rashomon set.”

Tsuchiya and Baba, Sec. 3.3 · Section 3.3

“Agreement Fraction: CON 0.97 ± 0.05 vs DIV 0.73 ± 0.12 (Income task, p<.001)”

Tsuchiya and Baba, Table 7 · Table 7

Main concerns

The study suffers from significant external validity limitations. Participants were Japanese crowdworkers (mean age $44.5$), and the authors acknowledge that 'conformity varies by culture,' yet make broad design claims without cross-cultural validation. More critically, the 'AI advisors' were decision trees from random forests presented through an LLM-generated explanation layer—not actual LLM agents. This abstraction misses critical phenomena in real multi-LLM consultation where models may have correlated errors, shared training data, or conversational dynamics. The human-likeness manipulation (Study 2) showed null effects on accuracy despite successful manipulation checks, suggesting either insufficient power ($n=88$ across three tasks) or that the chosen anthropomorphic cues (static photos, names) lack ecological validity compared to voice or interactive persona. The authors' claim that 'panel consensus can elicit informational conformity' relies on correlational evidence (Table 6) where $R \approx 0.42$ for pressure-reliance relationships—directionality remains unclear.

“The mean age was 44.5 years (SD=10.7). The sample consisted of 58% male and 42% female... conducted in Japanese, targeting residents of the Asian region.”

Tsuchiya and Baba, Sec. 3.5 · Section 3.5

“We simulated decision-making scenarios by presenting participants with individual decision trees sampled from a random forest as AI advisors... Explanations... were generated by incorporating feature attribution information into the LLM prompt.”

Tsuchiya and Baba, Sec. 3.3 · Section 3.3

“Participants were primarily online workers in Asia. Because conformity varies by culture, broader samples are needed for generalization.”

Tsuchiya and Baba, Sec. 6.7 · Section 6.7

Evidence and comparison

The paper adequately positions itself against prior work, contrasting its discrete-panel approach with Lu et al.'s sequential second-opinion paradigm and Song et al.'s (2024) finding that informational conformity may not arise with AI panels. However, the comparison to Song et al. is somewhat strained—the current study uses accuracy-grounded tasks where correctness is verifiable, while Song et al. examined opinion change on societal issues without ground truth. The claim that 'simple statistical aggregation usually outperforms humans' deliberative selection' is supported by classic literature (Soll and Larrick, 2009), yet the paper does not empirically test whether participants would perform better with an explicit majority-vote aggregation compared to the sequential advice presentation used. The relationship to Asch conformity studies is well-articulated, though the translation from perceptual line-judgment tasks to predictive inference merits deeper theoretical justification.

“Prior work also suggests that larger AI panels can trigger resistance or polarization, especially in discussions of value-sensitive social issues (Song et al., 2024).”

Tsuchiya and Baba, Sec. 2.1 · Section 2.1

“A landmark demonstration of this phenomenon is Asch's classic study. Even in a perceptually unambiguous task, when all surrounding peers provided the same incorrect response, participants conformed on average in 37% of the trials.”

Tsuchiya and Baba, Sec. 2.3 · Section 2.3

Reproducibility

Reproducibility is moderately supported but incomplete. The authors specify using GPT-4o via the OpenAI API for explanation generation with detailed prompts (Figure 3), and the Rashomon set construction from random forests is algorithmically precise ($70\%$ accuracy selection). However, the paper makes no mention of code, data, or stimuli availability—neither the decision tree models, the specific feature attributions, nor the explanation generation prompts are publicly archived. The attention-check methodology (selecting features highlighted by AI) is described but not validated for efficacy. Hyperparameters for the random forest (number of trees, feature sampling) are unspecified. Without access to the exact tree structures or the GPT-4o prompt templates, independent reproduction of the advice stimuli would be impossible. The study protocol appears rigorous (three attention checks, comprehension quizzes), yet the lack of open materials significantly impedes replication efforts.

“We converted them into concise natural language explanations using GPT-4o via the OpenAI API... The prompt design for GPT-4o is illustrated in Figure 3.”

Tsuchiya and Baba, Sec. 3.4 · Section 3.4

“Participants were excluded if they (i) failed three attention checks and completed the tasks in less than the 5th percentile of total completion time... or (ii) showed mechanical response patterns.”

Tsuchiya and Baba, Sec. 3.5 · Section 3.5

Abstract

Just as people improve decision-making by consulting diverse human advisors, they can now also consult with multiple AI systems. Prior work on group decision-making shows that advice aggregation creates pressure to conform, leading to overreliance. However, the conditions under which multi-AI consultation improves or undermines human decision-making remain unclear. We conducted experiments with three tasks in which participants received advice from panels of AIs. We varied panel size, within-panel consensus, and the human-likeness of presentation. Accuracy improved for small panels relative to a single AI; larger panels yielded no gains. The level of within-panel consensus affected participants' reliance on AI advice: High consensus fostered overreliance; a single dissent reduced pressure to conform; wide disagreement created confusion and undermined appropriate reliance. Human-like presentations increased perceived usefulness and agency in certain tasks, without raising conformity pressure. These findings yield design implications for presenting multi-AI advice that preserve accuracy while mitigating conformity.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.