When Minor Edits Matter: LLM-Driven Prompt Attack for Medical VLM Robustness in Ultrasound

cs.CV Yasamin Medghalchi, Milad Yazdani, Amirhossein Dabiriaghdam, Moein Heidari, Mojan Izadkhah, Zahra Kavian, Giuseppe Carenini, Lele Wang, Dena Shahriari, Ilker Hacihaliloglu · Mar 22, 2026

What it does

Why it matters

This paper proposes a black-box attack framework using an LLM to generate minimal, clinically plausible text edits guided by Monte Carlo Tree Search (MCTS), requiring no access to the target model's weights or gradients. The study reveals...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Medical Vision-Language Models (Med-VLMs) for ultrasound analysis are vulnerable to subtle prompt variations that mimic real clinical communication patterns. This paper proposes a black-box attack framework using an LLM to generate minimal, clinically plausible text edits guided by Monte Carlo Tree Search (MCTS), requiring no access to the target model's weights or gradients. The study reveals that small adversarial rewrites can drastically degrade diagnostic QA accuracy—raising critical safety concerns for deploying such systems in point-of-care settings where prompt variability is inherent.

Critical review

Verdict

Bottom line

The paper presents a practically relevant threat model and a well-engineered attack pipeline. The core finding—that Med-VLMs can be fooled by minimal, semantically preserved prompt edits found via MCTS—is convincingly demonstrated. However, the scope is narrow: it evaluates only multiple-choice question answering on 1,305 ultrasound images and lacks systematic comparison to simpler adversarial text baselines (e.g., random substitutions, character-level typos) or cross-model transfer analysis. The plausibility of 'humanized' edits relies solely on automated perplexity and similarity metrics without clinical validation.

“with Qwen-7B as the attacker, accuracy drops from 42.22% to 13.72%”

paper · Sec. 3, Table 1

“post-attack accuracy decreases by 14.89%–26.08% in absolute terms”

paper · Sec. 3

What holds up

The combination of LLM-based edit generation with MCTS selection is a sound methodological choice for navigating discrete text perturbations. The discovery that attack success correlates with low logit margin (high uncertainty) provides actionable insight for defense—prioritizing low-confidence examples during training could improve robustness. The observation that smaller attacker LLMs (Qwen-7B) outperform larger ones (Qwen-30B) due to their tendency to produce less 'conservative' edits is a nuanced finding that challenges assumptions about scale always improving attack effectiveness.

“successes concentrate at low margins, consistent with boundary-proximal decisions being easier to flip with small edits”

paper · Sec. 3, RQ3

“Qwen-7B produces more effective attacks (lower post-attack accuracy) than Qwen-30B, despite the latter being newer and generally stronger”

paper · Sec. 3, RQ2

Main concerns

The evaluation is limited to a single task (multiple-choice QA) on one dataset subset (U2-Bench disease-diagnosis), leaving open whether findings generalize to open-ended report generation or other imaging modalities. The claim of 'clinical plausibility' rests on automated PPL and embedding similarity thresholds rather than expert human evaluation of whether edited prompts would realistically appear in clinical workflows. No comparison is made to established adversarial text attack methods (e.g., TextFooler, BERT-Attack), so it remains unclear if the LLM-driven approach offers unique advantages over simpler strategies. Additionally, the PPL<15 filter appears arbitrary and not validated against clinician judgment.

“we retain only successful attacks with PPL < 15”

paper · Sec. 3

“For the attack scenario, we restrict our pipeline to instances that the model answers correctly before attack”

paper · Sec. 3

Evidence and comparison

The evidence supports the existence of vulnerability but does not adequately contextualize it. While Table 1 shows large accuracy drops, the paper does not demonstrate that LLM-generated edits are more effective than random word substitutions or typographical errors—an essential baseline omitted from the analysis. The depth distribution (Fig. 2) shows most attacks succeed in 2-3 edits, but this efficiency claim lacks comparison to non-MCTS greedy strategies. Comparisons to related work (CARES benchmark, oncology prompt injection) cite these studies but do not benchmark the proposed method against their attack templates, making relative effectiveness unclear.

“most successes occur at shallow depths, peaking at 2–3 edits”

paper · Sec. 3, Fig. 2

“Qwen-30B and GPT-4.1 mini yield slightly higher Sim. than Qwen-7B across targets”

paper · Sec. 3, Table 2

Reproducibility

Hyperparameters are reasonably specified: MCTS uses $c=1.4$, max depth 8, max 80 iterations, with UCT selection $\mathrm{UCT}(p,i) = \frac{V_i}{N_i} + c\sqrt{\frac{\ln(N_p+1)}{N_i}}$. However, reproducibility is hindered by the use of GPT-4.1 mini (a proprietary attacker LLM) and the fact that code is 'to be released publicly following the review process.' The restriction to initially correct predictions and the U2-Bench disease-diagnosis subset (1,305 samples) limits experimental scope, and no random seed information is provided for the stochastic MCTS and LLM generation processes.

“MCTS uses up to 80 iterations, maximum depth 8, and exploration constant $c=1.4$”

paper · Sec. 2

“Codes will be released publicly following the review process”

paper · Sec. 3

Abstract

Ultrasound is widely used in clinical practice due to its portability, cost-effectiveness, safety, and real-time imaging capabilities. However, image acquisition and interpretation remain highly operator dependent, motivating the development of robust AI-assisted analysis methods. Vision-language models (VLMs) have recently demonstrated strong multimodal reasoning capabilities and competitive performance in medical image analysis, including ultrasound. However, emerging evidence highlights significant concerns about their trustworthiness. In particular, adversarial robustness is critical because Med-VLMs operate via natural-language instructions, rendering prompt formulation a realistic and practically exploitable point of vulnerability. Small variations (typos, shorthand, underspecified requests, or ambiguous wording) can meaningfully shift model outputs. We propose a scalable adversarial evaluation framework that leverages a large language model (LLM) to generate clinically plausible adversarial prompt variants via "humanized" rewrites and minimal edits that mimic routine clinical communication. Using ultrasound multiple-choice question answering benchmarks, we systematically assess the vulnerability of SOTA Med-VLMs to these attacks, examine how attacker LLM capacity influences attack success, analyze the relationship between attack success and model confidence, and identify consistent failure patterns across models. Our results highlight realistic robustness gaps that must be addressed for safe clinical translation. Code will be released publicly following the review process.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.