The Dual Mechanisms of Spatial Reasoning in Vision-Language Models
This paper investigates how vision-language models (VLMs) perform spatial reasoning—the binding of objects to spatial relations. It reveals that VLMs rely on two concurrent mechanisms: a dominant one where the vision encoder encodes object layout globally across visual tokens (extending into background regions), and a secondary one where the language model backbone forms ordering representations over object tokens. The finding that enhancing these vision-derived spatial representations improves performance without fine-tuning challenges the prevailing focus on LM backbones and highlights the critical role of vision encoders in multimodal reasoning.
The paper presents a compelling mechanistic analysis of spatial reasoning in VLMs, convincingly demonstrating that vision encoders provide the primary source of ordering information while LM backbones play a secondary backup role. The causal evidence from interchange interventions is rigorous, and the practical demonstration that amplifying vision-derived signals corrects 30-50% of errors validates the mechanistic findings. As noted in Section 5.4, "amplifying visual ordering representations corrects more than 50\% of previously incorrect predictions for Gemma-3-4b-it and more than 30\% for Qwen2-VL-7B-Instruct."
The dual-mechanism finding is well-supported by systematic causal interventions that distinguish probing from causation. The authors show that "patching only the square-localized visual tokens, without the surrounding strip, does not reliably induce this change" (Section 5.2.2), whereas patching strip-aligned tokens immediately switches model outputs, demonstrating that ordering information is distributed across background tokens. The backup mechanism in the LM backbone—where ordering representations emerge in intermediate layers (13-17) when vision signals are ablated—provides elegant evidence for the secondary pathway. The consistency of results across two distinct architectures (Qwen2-VL-7B and Gemma-3-4b-it) and multiple datasets (synthetic Squares/Shapes/Objects and naturalistic What’sUp) strengthens the generalizability of the findings.
The intervention on naturalistic images requires adapting the What’sUp dataset by merging image pairs to create three-object scenes, which may introduce artificial spatial configurations; the authors acknowledge this but do not quantify the impact. The amplification intervention uses a hand-tuned coefficient $\alpha \in [1,15]$ without systematic analysis of how this hyperparameter generalizes across different image types, raising questions about the robustness of the fix. Additionally, while the paper establishes that spatial information forms "strip-like" patterns in the vision encoder, it does not explain why this information distributes globally rather than localizing to object tokens, leaving a gap in the mechanistic understanding of how the vision encoder constructs these representations in the first place. The claim that the LM mechanism is "secondary" relies on the magnitude of performance drop when vision signals are removed, but the extent to which these mechanisms operate redundantly versus complementarily across different spatial reasoning tasks remains unclear.
The evidence substantially supports the central claim that vision encoders dominate spatial reasoning, challenging the prevailing focus on LM backbones. The comparison to prior work on variable binding in language models (Dai et al., 2024; Prakash et al., 2025) is fair and well-cited, with the authors clearly distinguishing that while LMs localize ordering information to specific entity tokens, VLMs distribute this information across visual tokens. However, the paper could better situate its findings against concurrent work on spatial reasoning failures (e.g., Chen et al., 2025 on attention mechanisms) to clarify whether the distributed encoding represents a failure mode or a functional feature. The probing results showing that "positional information is distributed coherently across background tokens rather than localized to object regions" (Section 5.2.1) provide strong correlational support that is effectively validated by subsequent causal interventions.
The paper provides substantial experimental detail including prompts, image generation parameters (Appendix A.2), and model configurations. The authors commit to releasing code and data upon acceptance. However, the interchange intervention experiments require careful construction of clean-counterfactual pairs (50 per setting) and precise bounding box identification for What’sUp images, which may introduce implementation subtleties not fully specified (e.g., exact token selection criteria for "strips"). The amplification intervention relies on probe directions trained on specific bounding boxes, and without exact specifications of how these boxes are determined across all datasets, independent reproduction of the exact quantitative results may be challenging. The paper does not specify computational requirements or random seed settings, which could affect the reproducibility of the intervention results.
Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial reasoning performance on naturalistic images. Together, our results clarify how spatial association is computed within VLMs and highlight the central role of vision encoders in enabling spatial reasoning.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.