Consistent but Dangerous: Per-Sample Safety Classification Reveals False Reliability in Medical Vision-Language Models

cs.CV Binesh Sadanandan, Vahid Behzadan · Mar 22, 2026
Local to this browser
What it does
Medical vision-language models (VLMs) are increasingly evaluated for consistency—the invariance of predictions under paraphrased prompts—as a proxy for clinical reliability. This paper demonstrates that consistency alone is a fundamentally...
Why it matters
The authors introduce a four-quadrant per-sample taxonomy that jointly evaluates consistency and image reliance, revealing that models optimized for low flip rates often shift samples into a 'Dangerous' quadrant where predictions are...
Main concern
The paper presents a compelling critique of consistency-based evaluation in medical VLMs. The four-quadrant framework is conceptually elegant and operationally lightweight, requiring only one additional forward pass per sample.
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Medical vision-language models (VLMs) are increasingly evaluated for consistency—the invariance of predictions under paraphrased prompts—as a proxy for clinical reliability. This paper demonstrates that consistency alone is a fundamentally flawed safety metric because models can achieve perfect consistency by learning text shortcuts while completely ignoring the input image. The authors introduce a four-quadrant per-sample taxonomy that jointly evaluates consistency and image reliance, revealing that models optimized for low flip rates often shift samples into a 'Dangerous' quadrant where predictions are stable, accurate, and confident yet unchanged when the image is removed. Their findings expose a critical deployment trap: standard evaluation pipelines risk preferentially selecting models that appear reliable while being decision-invariant to visual evidence.

Critical review
Verdict
Bottom line

The paper presents a compelling critique of consistency-based evaluation in medical VLMs. The four-quadrant framework is conceptually elegant and operationally lightweight, requiring only one additional forward pass per sample. The empirical demonstration of the consistency-safety paradox—where LLaVA-Rad Base achieves a 1.5% flip rate while 98.5% of samples are Dangerous on PadChest—is striking evidence that current deployment checks are insufficient. However, the evaluation is limited to binary classification on two chest X-ray datasets with relatively small sample sizes, and the binary definition of image-reliance may miss nuanced cases where models use images for confidence calibration without changing discrete predictions.

“LLaVA-Rad Base... 98.5% Dangerous”
paper · Table 1
“The extreme case is LLaVA-Rad Base on PadChest: 1.5% flip rate (the lowest across all settings) yet 98.5% Dangerous (the highest).”
paper · Section 4.2
What holds up

The four-quadrant taxonomy successfully operationalizes the distinction between apparent and genuine reliability, providing an actionable framework for deployment evaluation. The strong negative correlation between flip rate and Dangerous fraction ($r=-0.89$, $\rho=-0.79$) across ten model-dataset combinations robustly supports the central thesis that consistency optimization trades image grounding for paraphrase stability. Most compelling is the per-quadrant accuracy analysis showing that Dangerous samples often achieve higher accuracy than Ideal ones, rendering them invisible to accuracy-based screening.

“Across all $n=10$ model-dataset combinations, the Pearson correlation is $r=-0.89$ and the Spearman rank correlation is $\rho=-0.79$”
paper · Section 4.2
“On PadChest, Targeted LoRA achieves 99.6% accuracy within the Dangerous quadrant vs. 26.1% in Ideal”
paper · Section 4.3
Main concerns

The primary limitation is restricted scope: the evaluation covers only binary yes/no questions on two chest X-ray datasets, with MIMIC-CXR using just 98 samples. The PadChest dataset's severe label imbalance (81% 'yes' ground truth) creates conditions where text-shortcut heuristics are naturally high-performing, potentially inflating Dangerous fractions in ways that may not generalize to balanced clinical scenarios. Additionally, the binary definition of image-reliance is coarse—the authors acknowledge it cannot detect cases where 'a model might use the image to calibrate confidence without changing the final prediction'—and the lack of theoretical analysis regarding when consistency training induces text shortcuts leaves the mechanism underspecified.

“Our evaluation covers binary (yes/no) questions on two chest X-ray datasets; quadrant distributions may differ for open-ended generation, multi-class tasks, or other modalities.”
paper · Section 5 (Limitations)
“This operationalization is intentionally binary and conservative: if the prediction is the same with and without the image, the model's consistency may be grounded in text patterns rather than visual evidence.”
paper · Section 3.1
Evidence and comparison

The evidence strongly supports the claim that consistency metrics alone are insufficient, with the consistency-safety paradox demonstrated across five model configurations from two families (MedGemma and LLaVA-Rad). The comparison to prior work appropriately positions the four-quadrant framework against PSF-Med's flip-rate-only approach. Validation experiments using KL divergence between image-conditioned and text-only distributions (AUROC=0.76 for detecting Dangerous samples) and image-swap tests (showing Dangerous samples are 78–97% swap-invariant vs. 26–77% for Ideal) provide orthogonal support that strengthens the binary reliance metric's credibility.

“KL divergence between image-conditioned and text-only distributions achieves AUROC=0.76 for detecting Dangerous samples (mean KL: Dangerous=0.12 vs. Ideal=1.14).”
paper · Section 5 (Complementary grounding checks)
“An image-swap test across all 4,497 samples... shows Dangerous samples are 78–97% swap-invariant vs. 26–77% for Ideal”
paper · Section 5
Reproducibility

The experimental setup uses publicly available models and standard datasets (MIMIC-CXR, PadChest) with published paraphrases (PSF-Med), enhancing reproducibility. LoRA hyperparameters are specified (rank 16, $\alpha=32$, layers 15–19 for Targeted) and the text-only baseline procedure is described in detail. However, the paper does not mention code availability or provide a repository link, which would be essential for independent verification of the quadrant classification pipeline. Dataset sizes are small (98 samples for MIMIC-CXR) and the PadChest evaluation uses 732-861 samples depending on model parsing differences, raising questions about statistical power for some comparisons.

“Targeted LoRA: low-rank adaptation applied to layers 15–19 (rank 16, $\alpha=32$, 0.1% of parameters)”
paper · Section 3.2
“MIMIC-CXR... with 98 balanced test samples... LLaVA-Rad models yield 78 evaluable samples due to parsing differences”
paper · Section 3.3
Abstract

Consistency under paraphrase, the property that semantically equivalent prompts yield identical predictions, is increasingly used as a proxy for reliability when deploying medical vision-language models (VLMs). We show this proxy is fundamentally flawed: a model can achieve perfect consistency by relying on text patterns rather than the input image. We introduce a four-quadrant per-sample safety taxonomy that jointly evaluates consistency (stable predictions across paraphrased prompts) and image reliance (predictions that change when the image is removed). Samples are classified as Ideal (consistent and image-reliant), Fragile (inconsistent but image-reliant), Dangerous (consistent but not image-reliant), or Worst (inconsistent and not image-reliant). Evaluating five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest), we find that LoRA fine-tuning dramatically reduces flip rates but shifts a majority of samples into the Dangerous quadrant: LLaVA-Rad Base achieves a 1.5% flip rate on PadChest while 98.5% of its samples are Dangerous. Critically, Dangerous samples exhibit high accuracy (up to 99.6%) and low entropy, making them invisible to standard confidence-based screening. We observe a negative correlation between flip rate and Dangerous fraction (r = -0.89, n=10) and recommend that deployment evaluations always pair consistency checks with a text-only baseline: a single additional forward pass that exposes the false reliability trap.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.