Consistent but Dangerous: Per-Sample Safety Classification Reveals False Reliability in Medical Vision-Language Models
Medical vision-language models (VLMs) are increasingly evaluated for consistency—the invariance of predictions under paraphrased prompts—as a proxy for clinical reliability. This paper demonstrates that consistency alone is a fundamentally flawed safety metric because models can achieve perfect consistency by learning text shortcuts while completely ignoring the input image. The authors introduce a four-quadrant per-sample taxonomy that jointly evaluates consistency and image reliance, revealing that models optimized for low flip rates often shift samples into a 'Dangerous' quadrant where predictions are stable, accurate, and confident yet unchanged when the image is removed. Their findings expose a critical deployment trap: standard evaluation pipelines risk preferentially selecting models that appear reliable while being decision-invariant to visual evidence.
The paper presents a compelling critique of consistency-based evaluation in medical VLMs. The four-quadrant framework is conceptually elegant and operationally lightweight, requiring only one additional forward pass per sample. The empirical demonstration of the consistency-safety paradox—where LLaVA-Rad Base achieves a 1.5% flip rate while 98.5% of samples are Dangerous on PadChest—is striking evidence that current deployment checks are insufficient. However, the evaluation is limited to binary classification on two chest X-ray datasets with relatively small sample sizes, and the binary definition of image-reliance may miss nuanced cases where models use images for confidence calibration without changing discrete predictions.
The four-quadrant taxonomy successfully operationalizes the distinction between apparent and genuine reliability, providing an actionable framework for deployment evaluation. The strong negative correlation between flip rate and Dangerous fraction ($r=-0.89$, $\rho=-0.79$) across ten model-dataset combinations robustly supports the central thesis that consistency optimization trades image grounding for paraphrase stability. Most compelling is the per-quadrant accuracy analysis showing that Dangerous samples often achieve higher accuracy than Ideal ones, rendering them invisible to accuracy-based screening.
The primary limitation is restricted scope: the evaluation covers only binary yes/no questions on two chest X-ray datasets, with MIMIC-CXR using just 98 samples. The PadChest dataset's severe label imbalance (81% 'yes' ground truth) creates conditions where text-shortcut heuristics are naturally high-performing, potentially inflating Dangerous fractions in ways that may not generalize to balanced clinical scenarios. Additionally, the binary definition of image-reliance is coarse—the authors acknowledge it cannot detect cases where 'a model might use the image to calibrate confidence without changing the final prediction'—and the lack of theoretical analysis regarding when consistency training induces text shortcuts leaves the mechanism underspecified.
The evidence strongly supports the claim that consistency metrics alone are insufficient, with the consistency-safety paradox demonstrated across five model configurations from two families (MedGemma and LLaVA-Rad). The comparison to prior work appropriately positions the four-quadrant framework against PSF-Med's flip-rate-only approach. Validation experiments using KL divergence between image-conditioned and text-only distributions (AUROC=0.76 for detecting Dangerous samples) and image-swap tests (showing Dangerous samples are 78–97% swap-invariant vs. 26–77% for Ideal) provide orthogonal support that strengthens the binary reliance metric's credibility.
The experimental setup uses publicly available models and standard datasets (MIMIC-CXR, PadChest) with published paraphrases (PSF-Med), enhancing reproducibility. LoRA hyperparameters are specified (rank 16, $\alpha=32$, layers 15–19 for Targeted) and the text-only baseline procedure is described in detail. However, the paper does not mention code availability or provide a repository link, which would be essential for independent verification of the quadrant classification pipeline. Dataset sizes are small (98 samples for MIMIC-CXR) and the PadChest evaluation uses 732-861 samples depending on model parsing differences, raising questions about statistical power for some comparisons.
Consistency under paraphrase, the property that semantically equivalent prompts yield identical predictions, is increasingly used as a proxy for reliability when deploying medical vision-language models (VLMs). We show this proxy is fundamentally flawed: a model can achieve perfect consistency by relying on text patterns rather than the input image. We introduce a four-quadrant per-sample safety taxonomy that jointly evaluates consistency (stable predictions across paraphrased prompts) and image reliance (predictions that change when the image is removed). Samples are classified as Ideal (consistent and image-reliant), Fragile (inconsistent but image-reliant), Dangerous (consistent but not image-reliant), or Worst (inconsistent and not image-reliant). Evaluating five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest), we find that LoRA fine-tuning dramatically reduces flip rates but shifts a majority of samples into the Dangerous quadrant: LLaVA-Rad Base achieves a 1.5% flip rate on PadChest while 98.5% of its samples are Dangerous. Critically, Dangerous samples exhibit high accuracy (up to 99.6%) and low entropy, making them invisible to standard confidence-based screening. We observe a negative correlation between flip rate and Dangerous fraction (r = -0.89, n=10) and recommend that deployment evaluations always pair consistency checks with a text-only baseline: a single additional forward pass that exposes the false reliability trap.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.