Mind over Space: Can Multimodal Large Language Models Mentally Navigate?
This paper addresses a critical gap in embodied AI: while Multimodal Large Language Models (MLLMs) excel at reactive, short-horizon planning, they fail at biological-like "mental navigation" โ the ability to construct global cognitive maps from long egocentric videos and simulate paths before acting. To tackle this, the authors introduce Video2Mental, a benchmark requiring models to generate structured hierarchical cognitive maps from videos exceeding five minutes, then produce landmark-grounded navigation plans validated in the Habitat simulator. They also propose NavMind, a Qwen3-VL-based model trained via difficulty-stratified progressive supervised fine-tuning to internalize these structured representations.
The paper makes a valuable contribution by formalizing mental navigation as a explicit multi-stage task and releasing a substantial benchmark with physical validation. However, the central claim that NavMind "internalizes" mental navigation capabilities is overstated. The approach essentially equates to supervised fine-tuning on structured JSON outputs (cognitive maps) followed by path planning, which is more akin to structured chain-of-thought generation than acquiring emergent spatial reasoning. While the progressive training strategy with rejection sampling improves performance on the target distribution, the paper does not establish that the model generalizes compositional spatial reasoning beyond training set patterns.
The Video2Mental benchmark is methodologically sound and addresses a real evaluation gap. By requiring explicit cognitive map generation as an intermediate step and validating plans through simulator-based physical interaction ($\text{SR}_p$ and SPL metrics) rather than just text matching, the protocol ensures that models cannot exploit superficial pattern matching. The scale is substantial with 23,700 samples, and the difficulty stratification by spatio-temporal span provides meaningful progressive evaluation. The finding that frontier models fail catastrophically at zero-shot structured spatial representation is well-demonstrated and important for the field.
First, the paper conflates structured output generation with genuine cognitive map formation. The "cognitive maps" are textual JSON structures output before planning, not learned latent representations; this is a supervised pipeline vulnerable to error propagation where planning cannot recover from mapping errors. Second, the rejection sampling strategy filters for "low-perplexity, simplistic trajectories" to focus on "difficult samples," which risks overfitting to dataset-specific hard examples rather than learning transferable spatial reasoning. Third, the comparison to "frontier commercial MLLMs" in zero-shot settings is misleading because these models were not designed for this specific structured task; the performance gap may reflect format compliance rather than fundamental capability differences.
The evidence supports the claim that standard pre-training does not produce mental navigation capabilities, as zero-shot $\text{SR}_p$ remains low for GPT-4o and Qwen3-VL. However, Table 1 (partially shown) suggests NavMind achieves high success rates, but the paper lacks analysis of whether improvements stem from better spatial reasoning or simply learning to output the specific JSON schema required by the simulator. Comparison to prior work is sparse; the paper mentions "spatial MLLMs" but does not include recent embodied navigation systems like UniNL or Navila as baselines, making it unclear whether NavMind improves upon the state-of-the-art in embodied agents or merely outperforms generalist models on a specific benchmark.
Reproducibility is significantly hampered by missing implementation details. The paper does not report hyperparameters for the difficulty-stratified SFT (learning rates, batch sizes, training steps), the specific criteria for rejection sampling perplexity thresholds, or computational resources required. While the Qwen3-VL base architecture is public, the paper makes no commitment to release the Video2Mental dataset, NavMind weights, or evaluation code. Without these, independent verification of the claimed "significant outperformance" is impossible. The simulator validation pipeline using "downstream navigation expert models" is mentioned but not described, leaving ambiguity about whether the same expert policy was used across all test conditions.
Despite the widespread adoption of MLLMs in embodied agents, their capabilities remain largely confined to reactive planning from immediate observations, consistently failing in spatial reasoning across extensive spatiotemporal scales. Cognitive science reveals that Biological Intelligence (BI) thrives on "mental navigation": the strategic construction of spatial representations from experience and the subsequent mental simulation of paths prior to action. To bridge the gap between AI and BI, we introduce Video2Mental, a pioneering benchmark for evaluating the mental navigation capabilities of MLLMs. The task requires constructing hierarchical cognitive maps from long egocentric videos and generating landmark-based path plans step by step, with planning accuracy verified through simulator-based physical interaction. Our benchmarking results reveal that mental navigation capability does not naturally emerge from standard pre-training. Frontier MLLMs struggle profoundly with zero-shot structured spatial representation, and their planning accuracy decays precipitously over extended horizons. To overcome this, we propose \textbf{NavMind}, a reasoning model that internalizes mental navigation using explicit, fine-grained cognitive maps as learnable intermediate representations. Through a difficulty-stratified progressive supervised fine-tuning paradigm, NavMind effectively bridges the gap between raw perception and structured planning. Experiments demonstrate that NavMind achieves superior mental navigation capabilities, significantly outperforming frontier commercial and spatial MLLMs.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.