Mirage The Illusion of Visual Understanding
This paper identifies a critical failure mode in multimodal AI evaluation called the 'mirage effect,' where vision-language models generate confident descriptions and reasoning about images that were never provided. The authors demonstrate that frontier models (GPT-5, Gemini-3-Pro, Claude Opus 4.5) retain 70–80% of their benchmark accuracy when evaluated without any visual input, with medical benchmarks showing 60–99% susceptibility to such non-visual inference. A text-only 3B-parameter model fine-tuned on chest X-ray questions outperforms both frontier multimodal systems and human radiologists, exposing how current benchmarks fail to distinguish genuine visual understanding from sophisticated textual pattern matching. The findings challenge the validity of accuracy metrics for multimodal systems and propose B-Clean, a method to filter benchmark questions that can be answered without images.
The paper presents a compelling conceptual distinction between 'mirage reasoning' (constructing false epistemic frames) and standard hallucination (filling ungrounded details within valid frames), supported by rigorous empirical measurements across multiple frontier models and benchmarks. The demonstration that a text-only 'super-guesser' outperforms radiologists on ReXVQA without image access, combined with the pathology bias observed in medical mirages, provides strong evidence that current evaluation paradigms are fundamentally flawed. However, the claim that mirage-mode and guess-mode represent distinct 'operating regimes' relies heavily on prompt sensitivity interpretations that could benefit from mechanistic validation, and the paper occasionally overstates the uniqueness of this phenomenon given prior work on language priors in VQA.
The quantitative evidence for the mirage effect is robust: across four frontier models and six benchmarks, the authors show $\text{Mirage Score} = \frac{\text{Accuracy}_{\text{mirage}}}{\text{Accuracy}_{\text{original}}} \times 100\%$ values of 70–80%, with medical benchmarks showing the highest susceptibility (60–99%). The pathology bias analysis is particularly striking—when asked to diagnose non-existent medical images, Gemini-3-Pro frequently generates serious conditions like STEMI and melanoma, creating dangerous silent failure modes for clinical deployment. The B-Clean methodology provides a practical, post-hoc solution for benchmark decontamination that removes 74–77% of questions as 'compromised,' revealing significant drops in accuracy and changed model rankings when only vision-necessary questions remain.
The interpretation of the guess-mode versus mirage-mode performance gap as evidence for 'distinct operating regimes' is underdetermined by the data provided; the observed decline in accuracy could reflect prompt sensitivity or changed uncertainty calibration rather than mechanistically different reasoning pathways. The paper elides the distinction between questions that require visual information by definition and those that happen to be answerable from world knowledge—if a question about a chest X-ray can be answered from clinical vignette text alone, this may indicate sophisticated medical reasoning rather than benchmark contamination. Additionally, while the authors note that B-Clean provides 'relative rather than absolute evaluation,' they do not adequately address how to validate that the remaining 23–26% of 'clean' questions actually require visual reasoning versus representing unmeasured residual biases or data contamination patterns.
The evidence convincingly establishes that current multimodal benchmarks are vulnerable to text-only solutions, extending prior work on language biases in VQA (Goyal et al., Agrawal et al.) to modern frontier models and medical contexts. The comparison between 'mirage' behavior and traditional hallucination is well-articulated: whereas hallucinations involve 'filling in ungrounded details within a valid epistemic frame,' mirages involve 'constructing a false epistemic frame' entirely. However, the paper's claim that hidden patterns and benchmark structures enable answering 'not captured by standard no-image guessing controls' somewhat understates the prior literature on VQAshortcut learning and data contamination, which the authors cite but do not clearly distinguish from their novel contribution of comparing implicit (mirage) versus explicit (guess) image-absent conditions.
The methodology is documented with sufficient detail for reproduction, including specific API versions (Azure OpenAI 2024-12-01), model configurations (temperature=1, reasoning_effort settings), and prompt templates for each benchmark. The super-guesser training uses standard tools (LLaMA-Factory, LoRA rank=8, α=16) and publicly available base models (Qwen2.5-3B-Instruct), with hyperparameters fully specified. However, the paper does not mention release of code, the Phantom-0 benchmark, or the trained super-guesser weights, which would be necessary for full reproduction. The reliance on proprietary APIs (GPT-5, Gemini-3-Pro, Claude Opus 4.5) whose exact architectures and training data are undisclosed limits mechanistic interpretability of the observed effects, and the use of GPT-5 as an automated judge for mirage detection introduces potential circularity that is not fully validated.
Multimodal AI systems have achieved remarkable performance across a broad range of real-world tasks, yet the mechanisms underlying visual-language reasoning remain surprisingly poorly understood. We report three findings that challenge prevailing assumptions about how these systems process and integrate visual information. First, Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided; we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. Third, when models were explicitly instructed to guess answers without image access, rather than being implicitly prompted to assume images were present, performance declined markedly. Explicit guessing appears to engage a more conservative response regime, in contrast to the mirage regime in which models behave as though images have been provided. These findings expose fundamental vulnerabilities in how visual-language models reason and are evaluated, pointing to an urgent need for private benchmarks that eliminate textual cues enabling non-visual inference, particularly in medical contexts where miscalibrated AI carries the greatest consequence. We introduce B-Clean as a principled solution for fair, vision-grounded evaluation of multimodal AI systems.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.