Entropy Alone is Insufficient for Safe Selective Prediction in LLMs

cs.CL Edward Phillips, Fredrik K. Gustafsson, Sean Wu, Anshul Thakur, David A. Clifton · Mar 22, 2026
Local to this browser
What it does
Selective prediction systems in LLMs abstain from answering uncertain questions to mitigate hallucination harms in high-stakes domains. This paper identifies a critical failure mode of entropy-based uncertainty quantification: the...
Why it matters
This paper identifies a critical failure mode of entropy-based uncertainty quantification: the 'confidently wrong' regime where models produce low-entropy hallucinations. The authors propose combining entropy signals with correctness...
Main concern
The paper demonstrates that entropy-based uncertainty methods are insufficient alone for safe selective prediction, identifying a model-dependent 'confidently wrong' failure mode where semantic entropy collapses to near-zero for both...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Selective prediction systems in LLMs abstain from answering uncertain questions to mitigate hallucination harms in high-stakes domains. This paper identifies a critical failure mode of entropy-based uncertainty quantification: the 'confidently wrong' regime where models produce low-entropy hallucinations. The authors propose combining entropy signals with correctness probes using logistic regression, and advocate for deployment-facing metrics—E-AURC and TCE—over AUROC to ensure systems can reliably operate at strict safety thresholds.

Critical review
Verdict
Bottom line

The paper demonstrates that entropy-based uncertainty methods are insufficient alone for safe selective prediction, identifying a model-dependent 'confidently wrong' failure mode where semantic entropy collapses to near-zero for both correct and hallucinated answers. The proposed combination with correctness probes consistently improves the risk–coverage trade-off (E-AURC) and calibration (TCE) across TriviaQA, BioASQ, and MedicalQA across four model families. However, the evaluation is limited to short-form QA tasks and 3B–8B parameter models, leaving uncertainty about generalization to long-form generation and frontier-scale systems.

“A substantial fraction of questions are assigned near-zero semantic entropy, collapsing to a vertical line. This cluster includes both correct answers and hallucinations, the latter being confidently wrong instances”
Phillips et al., Sec. 3.2 · Section 3.2
What holds up

The identification of the 'confidently wrong' regime is compellingly demonstrated for the Qwen model, where many answers 'collapse to SE=0, yet receive distinct PC probe scores' (Figure 2). The moderate Spearman correlation ($\rho_s \in [0.34, 0.54]$) between entropy and correctness signals (Table 1) supports the claim that these methods capture complementary aspects of uncertainty. The critique of AUROC is well-founded: Figure 3 shows entropy-based methods hitting a 'risk floor where their RC curves fail to enter a high-trust regime at non-trivial coverage', causing sharp calibration divergence at strict targets ($\alpha \leq 0.15$). The combined method achieves TCE as low as $0.029$ versus $0.036$ for SE alone (Table A3), validating the approach.

“Many answers, both correct and hallucinated, collapse to SE=0, yet receive distinct PC probe scores”
Phillips et al., Sec. 3.2 · Figure 2 caption
“Entropy-based methods exhibit a risk floor where their RC curves fail to enter a high-trust regime at non-trivial coverage”
Phillips et al., Sec. 3.4 · Section 3.4
Main concerns

The work is limited to short-form QA where correctness is judged against reference strings, raising concerns about whether the 'confidently wrong' failure mode generalizes to 'long-form generation, multi-step reasoning, or open-ended tasks where hallucinations can be harder to define' (Limitations). The evaluation covers only 3B–8B models; frontier models may exhibit different uncertainty characteristics. The correctness probe requires supervised training on labelled examples, creating 'an annotation dependency that may limit applicability in low-resource settings' (Limitations). Finally, the combiner is trained on the same split used to fit the PC probe, introducing 'a mild optimistic bias, mitigated but not eliminated by strong regularization' ($C=0.1$) (Limitations).

“Our experiments are limited to short-form QA tasks, where answers can be judged as hallucination relative to reference strings”
Phillips et al., Limitations · Limitations section
“Our correctness probe requires supervised training on labelled examples, introducing an annotation dependency”
Phillips et al., Limitations · Limitations section
“the combiner is trained on the same split used to fit the PC probe, meaning it sees probe predictions that were not held out; this introduces a mild optimistic bias”
Phillips et al., Limitations · Limitations section
Evidence and comparison

The evidence supports the core claims within the evaluated scope. Table 1 demonstrates that entropy method efficacy is highly model-dependent: SE outperforms the PC probe for Llama and Ministral, while the PC probe dominates for Qwen. Figure 1 shows that combining signals generally improves AUROC, AUPRC, E-AURC, and TCE across datasets, though the exception on MedicalQA (where 'neither the PC probe alone nor the combination reliably improves TCE') reveals limits when 'base model accuracy is low'. The comparison to prior work is fair—they acknowledge Xiong et al.'s finding that entropy outperforms probes on knowledge tasks, while extending this to show combination benefits in selective prediction contexts using metrics that reflect 'whether a system can be trusted to operate at a stated risk level' (Section 4).

“The clearest exception is SE on MedicalQA, where neither the PC probe alone nor the combination reliably improves TCE”
Phillips et al., Sec. 3.3 · Section 3.3
“metrics such as AURC and TCE better reflect whether a system can be trusted to operate at a stated risk level”
Phillips et al., Sec. 4 · Section 4
Reproducibility

The paper provides detailed methodological descriptions including hyperparameters (L2 regularization $C=0.1$, token position TBG, 70:15:15 splits) and specifies exact model versions (Ministral-8B-Instruct-2410, Llama-3.2-3B-Instruct, Qwen3-4B-Instruct-2507, Gemma-3-4B) (Appendix A). However, code availability is not explicitly stated in the main text or acknowledgments. Reproduction would require access to the LLM-as-judge pipeline for correctness labels following Kossen et al., which introduces potential variability. The use of 200 bootstrap iterations for confidence intervals provides statistical robustness, though the optimistic bias from combiner training requires careful attention for exact reproduction.

“Both the PC probe and the SEP are logistic regression classifiers (L2 regularisation, C=0.1)”
Phillips et al., Appendix A · Appendix A
“examples are split deterministically into train, calibration, and test sets in a 70:15:15 ratio”
Phillips et al., Appendix A · Appendix A
Abstract

Selective prediction systems can mitigate harms resulting from language model hallucinations by abstaining from answering in high-risk cases. Uncertainty quantification techniques are often employed to identify such cases, but are rarely evaluated in the context of the wider selective prediction policy and its ability to operate at low target error rates. We identify a model-dependent failure mode of entropy-based uncertainty methods that leads to unreliable abstention behaviour, and address it by combining entropy scores with a correctness probe signal. We find that across three QA benchmarks (TriviaQA, BioASQ, MedicalQA) and four model families, the combined score generally improves both the risk--coverage trade-off and calibration performance relative to entropy-only baselines. Our results highlight the importance of deployment-facing evaluation of uncertainty methods, using metrics that directly reflect whether a system can be trusted to operate at a stated risk level.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.