Entropy Alone is Insufficient for Safe Selective Prediction in LLMs
Selective prediction systems in LLMs abstain from answering uncertain questions to mitigate hallucination harms in high-stakes domains. This paper identifies a critical failure mode of entropy-based uncertainty quantification: the 'confidently wrong' regime where models produce low-entropy hallucinations. The authors propose combining entropy signals with correctness probes using logistic regression, and advocate for deployment-facing metrics—E-AURC and TCE—over AUROC to ensure systems can reliably operate at strict safety thresholds.
The paper demonstrates that entropy-based uncertainty methods are insufficient alone for safe selective prediction, identifying a model-dependent 'confidently wrong' failure mode where semantic entropy collapses to near-zero for both correct and hallucinated answers. The proposed combination with correctness probes consistently improves the risk–coverage trade-off (E-AURC) and calibration (TCE) across TriviaQA, BioASQ, and MedicalQA across four model families. However, the evaluation is limited to short-form QA tasks and 3B–8B parameter models, leaving uncertainty about generalization to long-form generation and frontier-scale systems.
The identification of the 'confidently wrong' regime is compellingly demonstrated for the Qwen model, where many answers 'collapse to SE=0, yet receive distinct PC probe scores' (Figure 2). The moderate Spearman correlation ($\rho_s \in [0.34, 0.54]$) between entropy and correctness signals (Table 1) supports the claim that these methods capture complementary aspects of uncertainty. The critique of AUROC is well-founded: Figure 3 shows entropy-based methods hitting a 'risk floor where their RC curves fail to enter a high-trust regime at non-trivial coverage', causing sharp calibration divergence at strict targets ($\alpha \leq 0.15$). The combined method achieves TCE as low as $0.029$ versus $0.036$ for SE alone (Table A3), validating the approach.
The work is limited to short-form QA where correctness is judged against reference strings, raising concerns about whether the 'confidently wrong' failure mode generalizes to 'long-form generation, multi-step reasoning, or open-ended tasks where hallucinations can be harder to define' (Limitations). The evaluation covers only 3B–8B models; frontier models may exhibit different uncertainty characteristics. The correctness probe requires supervised training on labelled examples, creating 'an annotation dependency that may limit applicability in low-resource settings' (Limitations). Finally, the combiner is trained on the same split used to fit the PC probe, introducing 'a mild optimistic bias, mitigated but not eliminated by strong regularization' ($C=0.1$) (Limitations).
The evidence supports the core claims within the evaluated scope. Table 1 demonstrates that entropy method efficacy is highly model-dependent: SE outperforms the PC probe for Llama and Ministral, while the PC probe dominates for Qwen. Figure 1 shows that combining signals generally improves AUROC, AUPRC, E-AURC, and TCE across datasets, though the exception on MedicalQA (where 'neither the PC probe alone nor the combination reliably improves TCE') reveals limits when 'base model accuracy is low'. The comparison to prior work is fair—they acknowledge Xiong et al.'s finding that entropy outperforms probes on knowledge tasks, while extending this to show combination benefits in selective prediction contexts using metrics that reflect 'whether a system can be trusted to operate at a stated risk level' (Section 4).
The paper provides detailed methodological descriptions including hyperparameters (L2 regularization $C=0.1$, token position TBG, 70:15:15 splits) and specifies exact model versions (Ministral-8B-Instruct-2410, Llama-3.2-3B-Instruct, Qwen3-4B-Instruct-2507, Gemma-3-4B) (Appendix A). However, code availability is not explicitly stated in the main text or acknowledgments. Reproduction would require access to the LLM-as-judge pipeline for correctness labels following Kossen et al., which introduces potential variability. The use of 200 bootstrap iterations for confidence intervals provides statistical robustness, though the optimistic bias from combiner training requires careful attention for exact reproduction.
Selective prediction systems can mitigate harms resulting from language model hallucinations by abstaining from answering in high-risk cases. Uncertainty quantification techniques are often employed to identify such cases, but are rarely evaluated in the context of the wider selective prediction policy and its ability to operate at low target error rates. We identify a model-dependent failure mode of entropy-based uncertainty methods that leads to unreliable abstention behaviour, and address it by combining entropy scores with a correctness probe signal. We find that across three QA benchmarks (TriviaQA, BioASQ, MedicalQA) and four model families, the combined score generally improves both the risk--coverage trade-off and calibration performance relative to entropy-only baselines. Our results highlight the importance of deployment-facing evaluation of uncertainty methods, using metrics that directly reflect whether a system can be trusted to operate at a stated risk level.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.