Causal Evidence that Language Models use Confidence to Drive Behavior

cs.LG Dharshan Kumaran, Nathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean · Mar 23, 2026

What it does

Why it matters

The authors develop a rigorous four-phase paradigm combining behavioral analysis, activation steering, and computational modeling to demonstrate that abstention arises from a two-stage confidence-decision pathway involving confidence...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper investigates whether large language models exhibit metacognitive control—specifically, whether they use internal confidence signals to guide abstention decisions (knowing when to answer versus withhold responses). The authors develop a rigorous four-phase paradigm combining behavioral analysis, activation steering, and computational modeling to demonstrate that abstention arises from a two-stage confidence-decision pathway involving confidence representation formation followed by threshold-based policy implementation. Their findings suggest that LLMs deploy native confidence signals in a structured manner paralleling biological metacognition, with substantial implications for safe AI deployment.

Critical review

Verdict

Bottom line

The paper presents compelling causal evidence that LLMs use confidence to drive abstention behavior, though some limitations constrain the scope of the causal claims. The four-phase design elegantly isolates confidence effects from confounds like question difficulty, RAG scores, and surface semantic features. The activation steering experiments in Gemma 3 27B provide direct causal manipulation: injecting high-confidence vectors reduced abstention from 66.5% to 7.0%, while mediation analysis confirms confidence redistribution accounts for 67.1% of the total effect. The finding that standardized effect sizes for confidence are "approximately an order of magnitude larger" than alternatives ($|\beta_{std}| = 0.99$ vs. ~0.1 for RAG/difficulty/embeddings) strongly supports the claim that abstention is driven by metacognitive signals rather than superficial heuristics.

“confidence emerged as the dominant predictor of abstention behavior, with standardized effect sizes approximately an order of magnitude larger than alternative mechanisms based on knowledge retrieval accessibility (RAG scores), surface-level semantic features (sentence embeddings), or aggregate question difficulty”

paper · Abstract

“Confidence redistribution accounted for 67.1% of the total effect, while policy shifts accounted for 26.2%”

paper · Figure 5 caption

What holds up

The experimental design is rigorous and multifaceted, successfully operationalizing metacognitive control in artificial systems. Phase 1 establishes uncontaminated confidence measures by eliciting confidence "in the absence of an abstention option"; Phases 2-4 progressively test natural thresholding, causal manipulation via activation steering, and instructed thresholds. The computational modeling using logistic regression is appropriate, with $Pr(abstain = 1 | Conf, Diff) = \sigma(\beta_0 + \beta_C Conf + \beta_D Diff)$ and variance inflation factors confirming no multicollinearity (all VIFs < 1.6). The replication across multiple models (GPT-4o, Gemma 3 27B, DeepSeek 671B, Qwen 80B) strengthens generalizability, while the distinction between pre-decisional (Phase 1) and post-decisional (Phase 4) confidence signals—validated through the "bandness" index analysis—demonstrates theoretical sophistication.

“Phase 1 Chosen Confidence was used to model behavior in Phases 2 and 4 because it represents the model's pure confidence estimate uncontaminated by the presence of an abstention option”

paper · Section 3.4.1

“Variance inflation factors confirmed no multicollinearity among predictors (all VIFs < 1.6)”

paper · Section 3.4.5

Main concerns

Several limitations warrant caution in interpreting the causal claims. First, the activation steering experiments (Phase 3) were conducted only on Gemma 3 27B due to API constraints, meaning the strongest causal evidence applies to just one open-weight model. Second, Phase 4 required extensive prompt engineering for Gemma 3 27B ("20 paraphrases of the GPT4o prompt") to elicit meaningful abstention behavior, suggesting findings may be sensitive to prompt formatting and model-specific instruction-following capabilities. Third, the claim that confidence effect sizes are "an order of magnitude larger" than alternatives relies on standardized coefficients from a model where predictors show some intercorrelation (RAG-difficulty correlation 0.034, embedding-difficulty up to 0.17), potentially inflating confidence's relative contribution despite VIFs < 1.6. Finally, the two-stage framework describes computational-level architecture but explicitly avoids claims about "specific neural or algorithmic implementation in transformer architectures," leaving unresolved whether confidence arises from first-order evidence accumulation or higher-order monitoring processes.

“The prompt used for GPT-4o proved ineffective for Gemma 3 27b, yielding abstention rates of below 5% until the 80% threshold... To establish robust abstention behavior, we generated 20 paraphrases of the GPT4o prompt”

paper · Section 3.4.4

“Linearity in the logit was assessed via binned plots... which showed adequate linearity. For Phase 4 analyses... standard errors may be slightly underestimated”

paper · Section 3.4.5

Evidence and comparison

The evidence strongly supports the core claims relative to alternative mechanisms. The comparison to RAG scores, sentence embeddings, and aggregate difficulty is comprehensive: likelihood ratio tests show confidence adds substantial explanatory power beyond alternatives ($\Delta AIC > 100$), while adding RAG or embeddings to a confidence-inclusive model yields minimal gains ($\Delta AIC \approx -1$ to $-21$). The paper appropriately distinguishes itself from prior ML work that relies on "post-hoc thresholding, or specialized fine-tuning procedures" (Wen et al., 2025; Chuang et al., 2024). The methodology aligns with established neuroscience frameworks (Kepecs et al., 2008; Fleming & Daw, 2017), successfully mapping the "confidence-decision pathway" onto biological two-stage models. However, the comparison might benefit from deeper engagement with recent self-consistency-based abstention methods that operate without explicit confidence calibration.

“Adding confidence to a model already containing RAG scores produced a highly significant improvement (LR $\chi^2(1) = 141.13, p < 10^{-32}$)... while embeddings provided a modest but significant contribution (LR $\chi^2(10) = 41.42, p < 10^{-5}, \Delta AIC = -21.4$)”

paper · Section 4.2

“While emerging research has demonstrated the use of confidence for abstention... these typically rely on post-hoc thresholding, or specialized fine-tuning procedures”

paper · Section 2

Reproducibility

Reproducibility is mixed. The paper uses publicly available models (Gemma 3 27B via JAX, DeepSeek/Qwen via Together.ai, GPT-4o via API) and the SimpleQA dataset, but critical implementation details are incomplete. The exact prompt templates are referenced as "see Methods" but not fully reproduced in the main text; activation steering code and the specific 1000-question subsets are not provided. Reproducing Phase 3 requires access to internal activations, feasible only for open-weight models like Gemma. Hyperparameters like the "3% of the residual norm" steering vector scaling are reported without sensitivity analysis. The reliance on proprietary APIs (GPT-4o) with potential version updates creates barriers for reproducing Phases 1, 2, and 4, particularly given that "20 paraphrases" were required to find an effective prompt for Gemma in Phase 4.

“By tactically adding in e.g. the "Love" - "Hate" steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift”

Turner et al., 2023 · arXiv:2308.10248

“CAA computes "steering vectors" by averaging the difference in residual stream activations between pairs of positive and negative examples”

Panickssery et al., 2023 · arXiv:2312.06681

Abstract

Metacognition -- the ability to assess one's own cognitive performance -- is documented across species, with internal confidence estimates serving as a key signal for adaptive behavior. While confidence can be extracted from Large Language Model (LLM) outputs, whether models actively use these signals to regulate behavior remains a fundamental question. We investigate this through a four-phase abstention paradigm.Phase 1 established internal confidence estimates in the absence of an abstention option. Phase 2 revealed that LLMs apply implicit thresholds to these estimates when deciding to answer or abstain. Confidence emerged as the dominant predictor of behavior, with effect sizes an order of magnitude larger than knowledge retrieval accessibility (RAG scores) or surface-level semantic features. Phase 3 provided causal evidence through activation steering: manipulating internal confidence signals correspondingly shifted abstention rates. Finally, Phase 4 demonstrated that models can systematically vary abstention policies based on instructed thresholds.Our findings indicate that abstention arises from the joint operation of internal confidence representations and threshold-based policies, mirroring the two-stage metacognitive control found in biological systems. This capacity is essential as LLMs transition into autonomous agents that must recognize their own uncertainty to decide when to act or seek help.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.