Mirage The Illusion of Visual Understanding

cs.AI Mohammad Asadi, Jack W. O'Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, Euan Ashley · Mar 23, 2026

What it does

Why it matters

A text-only 3B-parameter model fine-tuned on chest X-ray questions outperforms both frontier multimodal systems and human radiologists, exposing how current benchmarks fail to distinguish genuine visual understanding from sophisticated...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper identifies a critical failure mode in multimodal AI evaluation called the 'mirage effect,' where vision-language models generate confident descriptions and reasoning about images that were never provided. The authors demonstrate that frontier models (GPT-5, Gemini-3-Pro, Claude Opus 4.5) retain 70–80% of their benchmark accuracy when evaluated without any visual input, with medical benchmarks showing 60–99% susceptibility to such non-visual inference. A text-only 3B-parameter model fine-tuned on chest X-ray questions outperforms both frontier multimodal systems and human radiologists, exposing how current benchmarks fail to distinguish genuine visual understanding from sophisticated textual pattern matching. The findings challenge the validity of accuracy metrics for multimodal systems and propose B-Clean, a method to filter benchmark questions that can be answered without images.

Critical review

Verdict

Bottom line

The paper presents a compelling conceptual distinction between 'mirage reasoning' (constructing false epistemic frames) and standard hallucination (filling ungrounded details within valid frames), supported by rigorous empirical measurements across multiple frontier models and benchmarks. The demonstration that a text-only 'super-guesser' outperforms radiologists on ReXVQA without image access, combined with the pathology bias observed in medical mirages, provides strong evidence that current evaluation paradigms are fundamentally flawed. However, the claim that mirage-mode and guess-mode represent distinct 'operating regimes' relies heavily on prompt sensitivity interpretations that could benefit from mechanistic validation, and the paper occasionally overstates the uniqueness of this phenomenon given prior work on language priors in VQA.

What holds up

The quantitative evidence for the mirage effect is robust: across four frontier models and six benchmarks, the authors show $\text{Mirage Score} = \frac{\text{Accuracy}_{\text{mirage}}}{\text{Accuracy}_{\text{original}}} \times 100\%$ values of 70–80%, with medical benchmarks showing the highest susceptibility (60–99%). The pathology bias analysis is particularly striking—when asked to diagnose non-existent medical images, Gemini-3-Pro frequently generates serious conditions like STEMI and melanoma, creating dangerous silent failure modes for clinical deployment. The B-Clean methodology provides a practical, post-hoc solution for benchmark decontamination that removes 74–77% of questions as 'compromised,' revealing significant drops in accuracy and changed model rankings when only vision-necessary questions remain.

“frontier models retain on average 70–80% of their fully image-enabled accuracies”

paper · Section 3.1

“hyper time-sensitive and resource intensive conditions such as ST-elevation myocardial infarction (STEMI), melanoma and carcinoma among the most commonly stated”

paper · Figure 2

“B-Clean benchmarks retained 240 of 1,042 questions for MicroVQA (77.0% removed), 514 of 2,000 for MedXpertQA-MM (74.3% removed), and 428 of 1,730 for MMMU-Pro (75.3% removed)”

paper · Section 5

Main concerns

The interpretation of the guess-mode versus mirage-mode performance gap as evidence for 'distinct operating regimes' is underdetermined by the data provided; the observed decline in accuracy could reflect prompt sensitivity or changed uncertainty calibration rather than mechanistically different reasoning pathways. The paper elides the distinction between questions that require visual information by definition and those that happen to be answerable from world knowledge—if a question about a chest X-ray can be answered from clinical vignette text alone, this may indicate sophisticated medical reasoning rather than benchmark contamination. Additionally, while the authors note that B-Clean provides 'relative rather than absolute evaluation,' they do not adequately address how to validate that the remaining 23–26% of 'clean' questions actually require visual reasoning versus representing unmeasured residual biases or data contamination patterns.

“When models were explicitly told that the image was missing and were instructed to guess, performance declined across most benchmark categories. This implies at least two distinct operating regimes”

paper · Section 4.1

“B-Clean is model-set dependent and provides relative rather than absolute evaluation”

paper · Section 6

Evidence and comparison

The evidence convincingly establishes that current multimodal benchmarks are vulnerable to text-only solutions, extending prior work on language biases in VQA (Goyal et al., Agrawal et al.) to modern frontier models and medical contexts. The comparison between 'mirage' behavior and traditional hallucination is well-articulated: whereas hallucinations involve 'filling in ungrounded details within a valid epistemic frame,' mirages involve 'constructing a false epistemic frame' entirely. However, the paper's claim that hidden patterns and benchmark structures enable answering 'not captured by standard no-image guessing controls' somewhat understates the prior literature on VQAshortcut learning and data contamination, which the authors cite but do not clearly distinguish from their novel contribution of comparing implicit (mirage) versus explicit (guess) image-absent conditions.

“Unlike hallucinations, which are defined as AI models filling in ungrounded details within a valid epistemic frame, the mirage effect involves constructing a false epistemic frame, i.e., describing a multi-modal input never provided by the user”

paper · Section 2.1

“Previous works in the field of AI evaluation have attempted to create benchmarks that truly evaluate the visual understanding by manually detecting and categorizing questions that are possible to answer without images”

paper · Section 4

Reproducibility

The methodology is documented with sufficient detail for reproduction, including specific API versions (Azure OpenAI 2024-12-01), model configurations (temperature=1, reasoning_effort settings), and prompt templates for each benchmark. The super-guesser training uses standard tools (LLaMA-Factory, LoRA rank=8, α=16) and publicly available base models (Qwen2.5-3B-Instruct), with hyperparameters fully specified. However, the paper does not mention release of code, the Phantom-0 benchmark, or the trained super-guesser weights, which would be necessary for full reproduction. The reliance on proprietary APIs (GPT-5, Gemini-3-Pro, Claude Opus 4.5) whose exact architectures and training data are undisclosed limits mechanistic interpretability of the observed effects, and the use of GPT-5 as an automated judge for mirage detection introduces potential circularity that is not fully validated.

“We used parameter-efficient fine-tuning via LoRA (rank =8, α=16, dropout =0) applied to all linear layers, with LoRA+ (learning rate ratio =16). Training was performed using the LLaMA-Factory library”

paper · Section 7.8

“To determine whether a model's response exhibited the mirage effect, we used GPT-5 as an automated judge”

paper · Section 7.3

Abstract

Multimodal AI systems have achieved remarkable performance across a broad range of real-world tasks, yet the mechanisms underlying visual-language reasoning remain surprisingly poorly understood. We report three findings that challenge prevailing assumptions about how these systems process and integrate visual information. First, Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided; we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. Third, when models were explicitly instructed to guess answers without image access, rather than being implicitly prompted to assume images were present, performance declined markedly. Explicit guessing appears to engage a more conservative response regime, in contrast to the mirage regime in which models behave as though images have been provided. These findings expose fundamental vulnerabilities in how visual-language models reason and are evaluated, pointing to an urgent need for private benchmarks that eliminate textual cues enabling non-visual inference, particularly in medical contexts where miscalibrated AI carries the greatest consequence. We introduce B-Clean as a principled solution for fair, vision-grounded evaluation of multimodal AI systems.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.