Causal Evidence that Language Models use Confidence to Drive Behavior
This paper investigates whether large language models exhibit metacognitive control—specifically, whether they use internal confidence signals to guide abstention decisions (knowing when to answer versus withhold responses). The authors develop a rigorous four-phase paradigm combining behavioral analysis, activation steering, and computational modeling to demonstrate that abstention arises from a two-stage confidence-decision pathway involving confidence representation formation followed by threshold-based policy implementation. Their findings suggest that LLMs deploy native confidence signals in a structured manner paralleling biological metacognition, with substantial implications for safe AI deployment.
The paper presents compelling causal evidence that LLMs use confidence to drive abstention behavior, though some limitations constrain the scope of the causal claims. The four-phase design elegantly isolates confidence effects from confounds like question difficulty, RAG scores, and surface semantic features. The activation steering experiments in Gemma 3 27B provide direct causal manipulation: injecting high-confidence vectors reduced abstention from 66.5% to 7.0%, while mediation analysis confirms confidence redistribution accounts for 67.1% of the total effect. The finding that standardized effect sizes for confidence are "approximately an order of magnitude larger" than alternatives ($|\beta_{std}| = 0.99$ vs. ~0.1 for RAG/difficulty/embeddings) strongly supports the claim that abstention is driven by metacognitive signals rather than superficial heuristics.
The experimental design is rigorous and multifaceted, successfully operationalizing metacognitive control in artificial systems. Phase 1 establishes uncontaminated confidence measures by eliciting confidence "in the absence of an abstention option"; Phases 2-4 progressively test natural thresholding, causal manipulation via activation steering, and instructed thresholds. The computational modeling using logistic regression is appropriate, with $Pr(abstain = 1 | Conf, Diff) = \sigma(\beta_0 + \beta_C Conf + \beta_D Diff)$ and variance inflation factors confirming no multicollinearity (all VIFs < 1.6). The replication across multiple models (GPT-4o, Gemma 3 27B, DeepSeek 671B, Qwen 80B) strengthens generalizability, while the distinction between pre-decisional (Phase 1) and post-decisional (Phase 4) confidence signals—validated through the "bandness" index analysis—demonstrates theoretical sophistication.
Several limitations warrant caution in interpreting the causal claims. First, the activation steering experiments (Phase 3) were conducted only on Gemma 3 27B due to API constraints, meaning the strongest causal evidence applies to just one open-weight model. Second, Phase 4 required extensive prompt engineering for Gemma 3 27B ("20 paraphrases of the GPT4o prompt") to elicit meaningful abstention behavior, suggesting findings may be sensitive to prompt formatting and model-specific instruction-following capabilities. Third, the claim that confidence effect sizes are "an order of magnitude larger" than alternatives relies on standardized coefficients from a model where predictors show some intercorrelation (RAG-difficulty correlation 0.034, embedding-difficulty up to 0.17), potentially inflating confidence's relative contribution despite VIFs < 1.6. Finally, the two-stage framework describes computational-level architecture but explicitly avoids claims about "specific neural or algorithmic implementation in transformer architectures," leaving unresolved whether confidence arises from first-order evidence accumulation or higher-order monitoring processes.
The evidence strongly supports the core claims relative to alternative mechanisms. The comparison to RAG scores, sentence embeddings, and aggregate difficulty is comprehensive: likelihood ratio tests show confidence adds substantial explanatory power beyond alternatives ($\Delta AIC > 100$), while adding RAG or embeddings to a confidence-inclusive model yields minimal gains ($\Delta AIC \approx -1$ to $-21$). The paper appropriately distinguishes itself from prior ML work that relies on "post-hoc thresholding, or specialized fine-tuning procedures" (Wen et al., 2025; Chuang et al., 2024). The methodology aligns with established neuroscience frameworks (Kepecs et al., 2008; Fleming & Daw, 2017), successfully mapping the "confidence-decision pathway" onto biological two-stage models. However, the comparison might benefit from deeper engagement with recent self-consistency-based abstention methods that operate without explicit confidence calibration.
Reproducibility is mixed. The paper uses publicly available models (Gemma 3 27B via JAX, DeepSeek/Qwen via Together.ai, GPT-4o via API) and the SimpleQA dataset, but critical implementation details are incomplete. The exact prompt templates are referenced as "see Methods" but not fully reproduced in the main text; activation steering code and the specific 1000-question subsets are not provided. Reproducing Phase 3 requires access to internal activations, feasible only for open-weight models like Gemma. Hyperparameters like the "3% of the residual norm" steering vector scaling are reported without sensitivity analysis. The reliance on proprietary APIs (GPT-4o) with potential version updates creates barriers for reproducing Phases 1, 2, and 4, particularly given that "20 paraphrases" were required to find an effective prompt for Gemma in Phase 4.
Metacognition -- the ability to assess one's own cognitive performance -- is documented across species, with internal confidence estimates serving as a key signal for adaptive behavior. While confidence can be extracted from Large Language Model (LLM) outputs, whether models actively use these signals to regulate behavior remains a fundamental question. We investigate this through a four-phase abstention paradigm.Phase 1 established internal confidence estimates in the absence of an abstention option. Phase 2 revealed that LLMs apply implicit thresholds to these estimates when deciding to answer or abstain. Confidence emerged as the dominant predictor of behavior, with effect sizes an order of magnitude larger than knowledge retrieval accessibility (RAG scores) or surface-level semantic features. Phase 3 provided causal evidence through activation steering: manipulating internal confidence signals correspondingly shifted abstention rates. Finally, Phase 4 demonstrated that models can systematically vary abstention policies based on instructed thresholds.Our findings indicate that abstention arises from the joint operation of internal confidence representations and threshold-based policies, mirroring the two-stage metacognitive control found in biological systems. This capacity is essential as LLMs transition into autonomous agents that must recognize their own uncertainty to decide when to act or seek help.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.