CVT-Bench: Counterfactual Viewpoint Transformations Reveal Unstable Spatial Representations in Multimodal LLMs
CVT-Bench evaluates whether multimodal LLMs can maintain stable spatial representations under counterfactual viewpoint transformations—such as inferring object relationships from a camera angle never shown in the image. Using 100 synthetic tabletop scenes and 6,000 relational queries across rotations from $0^{\circ}$ to $360^{\circ}$, the benchmark reveals that state-of-the-art models, despite high single-view accuracy, systematically fail at mental rotation tasks and degrade further under extended sequential context. These findings challenge the assumption that strong episodic spatial performance implies robust viewpoint-invariant representations, with critical implications for embodied AI and robotics applications requiring perspective-taking.
The paper presents a compelling diagnostic study that rigorously isolates viewpoint-conditioned spatial reasoning from perception using controlled CLEVR-based scenes. The core claim—that single-view accuracy overestimates robustness—is convincingly supported by U-shaped degradation curves and survival analysis across episodic and sequential settings. The finding that structured inputs (text, scene graphs) only partially mitigate instability suggests fundamental limitations in how MLLMs maintain relational state under transformation. However, the evaluation is restricted to synthetic tabletop environments with single-axis vertical rotations, leaving uncertain whether these representational fragilities generalize to complex real-world scenes with cluttered backgrounds and full 6DoF viewpoint changes.
The experimental design controls precisely for confounding factors by using procedurally generated CLEVR scenes with exact 3D coordinates and unique object tags, eliminating object recognition ambiguity (verified by perfect tagging accuracy). The survival metric $\text{Survival}(t)=\frac{1}{N}\sum_{n=1}^{N}\prod_{k=1}^{t}\mathbb{I}(\hat{r}_{n,k}=r_{n,k})$ quantifies how spatial reasoning coherence decays over sequential interaction, revealing that accuracy drops precipitously even when prompts occupy only $\sim$10–33% of model context windows. The three-modality comparison (image, text-only, scene graph) successfully isolates whether failures stem from perception or reasoning, demonstrating that explicit geometric structure improves but does not eliminate viewpoint transformation errors.
The benchmark's ecological validity is limited by its reliance on synthetic CLEVR scenes featuring simple geometric primitives on empty tabletops, which lack the texture, lighting complexity, and semantic diversity of real-world environments. The authors exclude Qwen-3-VL and MolMo-2 from analysis due to "degenerate response patterns" without quantitative characterization in the main text, potentially introducing selection bias toward models that happen to format outputs correctly for their parsing pipeline. The sequential protocol concatenates up to 20 independent scenes without intermediate conversational turns or agent actions, creating an artificial stress test that differs from realistic embodied interaction where spatial queries are interleaved with observation and navigation. Furthermore, restricting counterfactual transformations to vertical-axis rotations ($\theta \in \{45^{\circ}, \dots, 360^{\circ}\}$) leaves unexamined whether models can handle arbitrary 6DoF viewpoint changes involving pitch, roll, or translation.
The angle-dependent failure patterns provide strong evidence for systematic rather than random errors: models consistently exhibit sharp performance drops at $90^{\circ}$ and $270^{\circ}$ rotations with partial recovery at $180^{\circ}$, suggesting brittle heuristic transformation strategies rather than robust mental rotation. The comparison between episodic and sequential settings (BS=1 vs BS=20) clearly demonstrates that spatial instability emerges from context accumulation, not just geometric difficulty. The authors appropriately distinguish their work from embodied benchmarks like ALFRED and Habitat by isolating passive spatial reasoning from control and navigation. However, the paper does not explicitly reconcile its findings with recent structured-reasoning approaches such as SpatialVLM or Struct2D, which target similar spatial grounding challenges but emphasize training-time interventions rather than diagnostic evaluation of frozen models.
Reproducibility is supported by detailed prompt templates for all three input representations (Image, Text-only, Scene Graph) provided in Appendix Section 11, along with exact API settings and model specifications including context utilization statistics. The benchmark construction uses 100 procedurally generated scenes with precise occlusion ratios, density levels (Sparse/Dense), and ground-truth coordinate annotations documented in Table S1. However, the paper does not explicitly state whether the generation code, scene assets, or evaluation scripts will be publicly released beyond the project page URL. Independent reproduction is complicated by reliance on proprietary closed-source models (GPT-5.2, Gemini-3.1 Pro) whose weights and training corpora are undisclosed and subject to API version drift, though the inclusion of open-weights models (Qwen-3.5-OS, Kimi-2.5) partially mitigates this concern.
Multimodal large language models (MLLMs) achieve strong performance on single-view spatial reasoning tasks, yet it remains unclear whether they maintain stable spatial state representations under counterfactual viewpoint changes. We introduce a controlled diagnostic benchmark that evaluates relational consistency under hypothetical camera orbit transformations without re-rendering images. Across 100 synthetic scenes and 6,000 relational queries, we measure viewpoint consistency, 360{\deg} cycle agreement, and relational stability over sequential transformations. Despite high single-view accuracy, state-of-the-art MLLMs exhibit systematic degradation under counterfactual viewpoint changes, with frequent violations of cycle consistency and rapid decay in relational stability. We further evaluate multiple input representations, visual input, textual bounding boxes, and structured scene graphs, and show that increasing representational structure improves stability. Our results suggest that single-view spatial accuracy overestimates the robustness of induced spatial representations and that representation structure plays a critical role in counterfactual spatial reasoning.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.