Mind over Space: Can Multimodal Large Language Models Mentally Navigate?

cs.AI Qihui Zhu, Shouwei Ruan, Xiao Yang, Hao Jiang, Yao Huang, Shiji Zhao, Hanwei Fan, Hang Su, Xingxing Wei · Mar 23, 2026

What it does

Why it matters

To tackle this, the authors introduce Video2Mental, a benchmark requiring models to generate structured hierarchical cognitive maps from videos exceeding five minutes, then produce landmark-grounded navigation plans validated in the...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper addresses a critical gap in embodied AI: while Multimodal Large Language Models (MLLMs) excel at reactive, short-horizon planning, they fail at biological-like "mental navigation" — the ability to construct global cognitive maps from long egocentric videos and simulate paths before acting. To tackle this, the authors introduce Video2Mental, a benchmark requiring models to generate structured hierarchical cognitive maps from videos exceeding five minutes, then produce landmark-grounded navigation plans validated in the Habitat simulator. They also propose NavMind, a Qwen3-VL-based model trained via difficulty-stratified progressive supervised fine-tuning to internalize these structured representations.

Critical review

Verdict

Bottom line

The paper makes a valuable contribution by formalizing mental navigation as a explicit multi-stage task and releasing a substantial benchmark with physical validation. However, the central claim that NavMind "internalizes" mental navigation capabilities is overstated. The approach essentially equates to supervised fine-tuning on structured JSON outputs (cognitive maps) followed by path planning, which is more akin to structured chain-of-thought generation than acquiring emergent spatial reasoning. While the progressive training strategy with rejection sampling improves performance on the target distribution, the paper does not establish that the model generalizes compositional spatial reasoning beyond training set patterns.

“NavMind, a reasoning model that internalizes mental navigation using explicit, fine-grained cognitive maps as learnable intermediate representations”

paper · Abstract

What holds up

The Video2Mental benchmark is methodologically sound and addresses a real evaluation gap. By requiring explicit cognitive map generation as an intermediate step and validating plans through simulator-based physical interaction ($\text{SR}_p$ and SPL metrics) rather than just text matching, the protocol ensures that models cannot exploit superficial pattern matching. The scale is substantial with 23,700 samples, and the difficulty stratification by spatio-temporal span provides meaningful progressive evaluation. The finding that frontier models fail catastrophically at zero-shot structured spatial representation is well-demonstrated and important for the field.

“We validate generated plans through physical interaction in the Habitat simulator, ensuring faithful evaluation of their physical correctness”

paper · Introduction

“Video2Mental, a large-scale benchmark comprising 23,700 high-difficulty mental navigation samples”

paper · Section 1

Main concerns

First, the paper conflates structured output generation with genuine cognitive map formation. The "cognitive maps" are textual JSON structures output before planning, not learned latent representations; this is a supervised pipeline vulnerable to error propagation where planning cannot recover from mapping errors. Second, the rejection sampling strategy filters for "low-perplexity, simplistic trajectories" to focus on "difficult samples," which risks overfitting to dataset-specific hard examples rather than learning transferable spatial reasoning. Third, the comparison to "frontier commercial MLLMs" in zero-shot settings is misleading because these models were not designed for this specific structured task; the performance gap may reflect format compliance rather than fundamental capability differences.

“rejection sampling to filter out low-perplexity, simplistic trajectories, we steer the optimization toward difficult samples that demand deep spatial reasoning rather than mere pattern memorization”

paper · Abstract

“Frontier MLLMs struggle profoundly with zero-shot structured spatial representation”

paper · Introduction

Evidence and comparison

The evidence supports the claim that standard pre-training does not produce mental navigation capabilities, as zero-shot $\text{SR}_p$ remains low for GPT-4o and Qwen3-VL. However, Table 1 (partially shown) suggests NavMind achieves high success rates, but the paper lacks analysis of whether improvements stem from better spatial reasoning or simply learning to output the specific JSON schema required by the simulator. Comparison to prior work is sparse; the paper mentions "spatial MLLMs" but does not include recent embodied navigation systems like UniNL or Navila as baselines, making it unclear whether NavMind improves upon the state-of-the-art in embodied agents or merely outperforms generalist models on a specific benchmark.

“NavMind achieves superior mental navigation capabilities, significantly outperforming frontier commercial and spatial MLLMs”

paper · Table 1 caption

Reproducibility

Reproducibility is significantly hampered by missing implementation details. The paper does not report hyperparameters for the difficulty-stratified SFT (learning rates, batch sizes, training steps), the specific criteria for rejection sampling perplexity thresholds, or computational resources required. While the Qwen3-VL base architecture is public, the paper makes no commitment to release the Video2Mental dataset, NavMind weights, or evaluation code. Without these, independent verification of the claimed "significant outperformance" is impossible. The simulator validation pipeline using "downstream navigation expert models" is mentioned but not described, leaving ambiguity about whether the same expert policy was used across all test conditions.

“Built upon the Qwen3-VL architecture, NavMind is trained on the training split of Video2Mental through a two-stage process”

paper · Section 1

“The generated plans are further evaluated in a simulator using downstream navigation expert models across multiple metrics”

paper · Figure 1 caption

Abstract

Despite the widespread adoption of MLLMs in embodied agents, their capabilities remain largely confined to reactive planning from immediate observations, consistently failing in spatial reasoning across extensive spatiotemporal scales. Cognitive science reveals that Biological Intelligence (BI) thrives on "mental navigation": the strategic construction of spatial representations from experience and the subsequent mental simulation of paths prior to action. To bridge the gap between AI and BI, we introduce Video2Mental, a pioneering benchmark for evaluating the mental navigation capabilities of MLLMs. The task requires constructing hierarchical cognitive maps from long egocentric videos and generating landmark-based path plans step by step, with planning accuracy verified through simulator-based physical interaction. Our benchmarking results reveal that mental navigation capability does not naturally emerge from standard pre-training. Frontier MLLMs struggle profoundly with zero-shot structured spatial representation, and their planning accuracy decays precipitously over extended horizons. To overcome this, we propose \textbf{NavMind}, a reasoning model that internalizes mental navigation using explicit, fine-grained cognitive maps as learnable intermediate representations. Through a difficulty-stratified progressive supervised fine-tuning paradigm, NavMind effectively bridges the gap between raw perception and structured planning. Experiments demonstrate that NavMind achieves superior mental navigation capabilities, significantly outperforming frontier commercial and spatial MLLMs.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.