When Minor Edits Matter: LLM-Driven Prompt Attack for Medical VLM Robustness in Ultrasound
Medical Vision-Language Models (Med-VLMs) for ultrasound analysis are vulnerable to subtle prompt variations that mimic real clinical communication patterns. This paper proposes a black-box attack framework using an LLM to generate minimal, clinically plausible text edits guided by Monte Carlo Tree Search (MCTS), requiring no access to the target model's weights or gradients. The study reveals that small adversarial rewrites can drastically degrade diagnostic QA accuracy—raising critical safety concerns for deploying such systems in point-of-care settings where prompt variability is inherent.
The paper presents a practically relevant threat model and a well-engineered attack pipeline. The core finding—that Med-VLMs can be fooled by minimal, semantically preserved prompt edits found via MCTS—is convincingly demonstrated. However, the scope is narrow: it evaluates only multiple-choice question answering on 1,305 ultrasound images and lacks systematic comparison to simpler adversarial text baselines (e.g., random substitutions, character-level typos) or cross-model transfer analysis. The plausibility of 'humanized' edits relies solely on automated perplexity and similarity metrics without clinical validation.
The combination of LLM-based edit generation with MCTS selection is a sound methodological choice for navigating discrete text perturbations. The discovery that attack success correlates with low logit margin (high uncertainty) provides actionable insight for defense—prioritizing low-confidence examples during training could improve robustness. The observation that smaller attacker LLMs (Qwen-7B) outperform larger ones (Qwen-30B) due to their tendency to produce less 'conservative' edits is a nuanced finding that challenges assumptions about scale always improving attack effectiveness.
The evaluation is limited to a single task (multiple-choice QA) on one dataset subset (U2-Bench disease-diagnosis), leaving open whether findings generalize to open-ended report generation or other imaging modalities. The claim of 'clinical plausibility' rests on automated PPL and embedding similarity thresholds rather than expert human evaluation of whether edited prompts would realistically appear in clinical workflows. No comparison is made to established adversarial text attack methods (e.g., TextFooler, BERT-Attack), so it remains unclear if the LLM-driven approach offers unique advantages over simpler strategies. Additionally, the PPL<15 filter appears arbitrary and not validated against clinician judgment.
The evidence supports the existence of vulnerability but does not adequately contextualize it. While Table 1 shows large accuracy drops, the paper does not demonstrate that LLM-generated edits are more effective than random word substitutions or typographical errors—an essential baseline omitted from the analysis. The depth distribution (Fig. 2) shows most attacks succeed in 2-3 edits, but this efficiency claim lacks comparison to non-MCTS greedy strategies. Comparisons to related work (CARES benchmark, oncology prompt injection) cite these studies but do not benchmark the proposed method against their attack templates, making relative effectiveness unclear.
Hyperparameters are reasonably specified: MCTS uses $c=1.4$, max depth 8, max 80 iterations, with UCT selection $\mathrm{UCT}(p,i) = \frac{V_i}{N_i} + c\sqrt{\frac{\ln(N_p+1)}{N_i}}$. However, reproducibility is hindered by the use of GPT-4.1 mini (a proprietary attacker LLM) and the fact that code is 'to be released publicly following the review process.' The restriction to initially correct predictions and the U2-Bench disease-diagnosis subset (1,305 samples) limits experimental scope, and no random seed information is provided for the stochastic MCTS and LLM generation processes.
Ultrasound is widely used in clinical practice due to its portability, cost-effectiveness, safety, and real-time imaging capabilities. However, image acquisition and interpretation remain highly operator dependent, motivating the development of robust AI-assisted analysis methods. Vision-language models (VLMs) have recently demonstrated strong multimodal reasoning capabilities and competitive performance in medical image analysis, including ultrasound. However, emerging evidence highlights significant concerns about their trustworthiness. In particular, adversarial robustness is critical because Med-VLMs operate via natural-language instructions, rendering prompt formulation a realistic and practically exploitable point of vulnerability. Small variations (typos, shorthand, underspecified requests, or ambiguous wording) can meaningfully shift model outputs. We propose a scalable adversarial evaluation framework that leverages a large language model (LLM) to generate clinically plausible adversarial prompt variants via "humanized" rewrites and minimal edits that mimic routine clinical communication. Using ultrasound multiple-choice question answering benchmarks, we systematically assess the vulnerability of SOTA Med-VLMs to these attacks, examine how attacker LLM capacity influences attack success, analyze the relationship between attack success and model confidence, and identify consistent failure patterns across models. Our results highlight realistic robustness gaps that must be addressed for safe clinical translation. Code will be released publicly following the review process.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.