Deterministic Hallucination Detection in Medical VQA via Confidence-Evidence Bayesian Gain

cs.AI Mohammad Asadi, Tahoura Nedaee, Jack W. O'Sullivan, Euan Ashley, Ehsan Adeli · Mar 23, 2026

What it does

Why it matters

By combining token-level predictive variance with visual evidence magnitude derived from log-probabilities, the method detects when models generate responses that contradict input images. This approach achieves superior detection accuracy...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

The paper proposes CEBaG, a deterministic hallucination detection method for medical Visual Question Answering that eliminates the need for costly stochastic sampling. By combining token-level predictive variance with visual evidence magnitude derived from log-probabilities, the method detects when models generate responses that contradict input images. This approach achieves superior detection accuracy while reducing computational cost from 20+ generations to just three forward passes, addressing a critical safety bottleneck in clinical AI deployment.

Critical review

Verdict

Bottom line

The paper presents a compelling solution to a critical safety problem in medical AI. The core insight—that hallucinations manifest as inconsistent token-level confidence and weak visual grounding directly accessible in log-probabilities—is well-motivated and empirically validated across 16 experimental settings. CEBaG's deterministic formulation $\sigma \cdot (1+E)$ requires no hyperparameters, external models, or stochastic sampling, yet outperforms Vision-Amplified Semantic Entropy (VASE) by an average of 8 AUC points. However, the reliance on white-box access to model internals limits applicability to proprietary APIs, and the use of the GREEN model for automated ground-truth labeling introduces potential evaluation bias that remains unquantified.

“The resulting score, $\sigma \cdot (1+|G|/L)$, is parameter-free, requires only one generation and two scoring passes, uses no external models, and is fully deterministic.”

paper · Section 2.2

“CEBaG achieves the highest AUC in 13 of 16 settings and improves over VASE by 8 AUC points on average, while being fully deterministic and self-contained.”

paper · Abstract

What holds up

The theoretical grounding connecting evidence gain $G$ to Pointwise Mutual Information (Equations 3-4) provides a principled justification for measuring visual grounding. The ablation study robustly demonstrates that token-level variance alone ($\sigma$ only) already achieves 67.0% average AUC, substantially outperforming all sampling-based baselines, while the multiplicative combination with evidence magnitude captures complementary failure modes. Most impressively, the parameter-free formula achieves 67.9% AUC compared to 71.9% for the best hyperparameter-tuned variant across all settings, proving the fixed formulation captures nearly all available signal without dataset-specific tuning.

“Token-level variance is the dominant signal. $\sigma$ only already achieves 67.0% average AUC, substantially outperforming all baselines (SE 57.4%, VASE 59.6%, RadFlag 57.8%).”

paper · Section 3.3 (Table 3 caption)

“The parameter-free formula is competitive with tuned variants. CEBaG$_\lambda$... achieves 71.9% average AUC, only 4.0 points above CEBaG.”

paper · Section 3.3

Main concerns

The primary methodological concern is the reliance on the GREEN model for ground-truth hallucination labels, creating a potential circularity where detection performance depends on another model's judgment rather than human annotation. The paper states they "use the GREEN model [23] to produce reference-based ground-truth labels: a response is labeled as hallucinated if its GREEN score falls below 1.0," yet provides no analysis of GREEN's error rate, bias, or correlation with expert judgments. Additionally, while claiming the method is "hyperparameter-free," the binary classification threshold (1.0 for GREEN scores) effectively becomes a fixed hyperparameter, and the analysis lacks investigation of failure modes—specifically which types of hallucinations CEBaG misses. The restriction to white-box models requiring log-probability access significantly limits deployment options for proprietary medical AI systems.

“we use the GREEN model [23] to produce reference-based ground-truth labels: a response is labeled as hallucinated if its GREEN score falls below 1.0.”

paper · Section 3

“A limitation is its reliance on access to model log-probabilities, restricting use to white-box models.”

paper · Section 4

Evidence and comparison

The evidence strongly supports the efficiency claims and relative performance gains. The comparison between CEBaG and sampling-based methods (SE, VASE) is comprehensive and fair, with Table 2 documenting that CEBaG reduces forward passes from 20 to 3 while improving average AUC from 59.6% to 67.9%. However, the evaluation lacks comparison against simpler deterministic baselines such as minimum token probability or sequence perplexity, which would isolate whether the multiplicative combination of $\sigma$ and $E$ provides benefits beyond simpler uncertainty metrics. While the cross-model evaluation (MedGemma variants, HuatuoGPT, LLaVA-Med) supports generalizability, all datasets focus on radiology and pathology within English-language benchmarks, leaving open questions about performance in other medical domains or languages.

“With the standard setting $M=10$, SE requires 10 autoregressive generations plus $O(M^2)=O(100)$ pairwise entailment comparisons per sample. VASE doubles this to 20 passes plus the same entailment overhead.”

paper · Section 3.2

“On average, CEBaG attains 67.9% AUC, outperforming VASE (59.6%), SE (57.4%), and RadFlag (57.8%) by substantial margins, with an improvement of +8.2 AUC points over the previous state-of-the-art VASE.”

paper · Section 3.1

Reproducibility

The paper provides detailed implementation specifics including exact decoding parameters (greedy decoding with $T=0.1$), the procedure for text-only inference (removing image tokens at the interface level for both encoder-decoder and LLaVA architectures), and baseline hyperparameters ($M=10$, $T=1.0$ for SE/VASE). Because CEBaG is deterministic, run-to-run variance is eliminated, making reported scores exact for given model outputs. However, reproduction is currently blocked because "The code will be made available upon acceptance." Critical missing details include exact prompt templates, batch sizes, and the specific checkpoint versions of the medical MLLMs used in the experiments.

“For CEBaG, the generated answer $r$ is produced via greedy decoding (temperature $T=0.1$).”

paper · Section 3

“The code will be made available upon acceptance.”

paper · Abstract

Abstract

Multimodal large language models (MLLMs) have shown strong potential for medical Visual Question Answering (VQA), yet they remain prone to hallucinations, defined as generating responses that contradict the input image, posing serious risks in clinical settings. Current hallucination detection methods, such as Semantic Entropy (SE) and Vision-Amplified Semantic Entropy (VASE), require 10 to 20 stochastic generations per sample together with an external natural language inference model for semantic clustering, making them computationally expensive and difficult to deploy in practice. We observe that hallucinated responses exhibit a distinctive signature directly in the model's own log-probabilities: inconsistent token-level confidence and weak sensitivity to visual evidence. Based on this observation, we propose Confidence-Evidence Bayesian Gain (CEBaG), a deterministic hallucination detection method that requires no stochastic sampling, no external models, and no task-specific hyperparameters. CEBaG combines two complementary signals: token-level predictive variance, which captures inconsistent confidence across response tokens, and evidence magnitude, which measures how much the image shifts per-token predictions relative to text-only inference. Evaluated across four medical MLLMs and three VQA benchmarks (16 experimental settings), CEBaG achieves the highest AUC in 13 of 16 settings and improves over VASE by 8 AUC points on average, while being fully deterministic and self-contained. The code will be made available upon acceptance.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.