Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models
Omni-WorldBench addresses the gap between passive video generation metrics and active world model evaluation by focusing on interactive response—how actions causally drive state transitions across space and time. It introduces Omni-WorldSuite, a 1,068-prompt hierarchical taxonomy spanning three interaction levels (single-object to global environmental effects), and Omni-Metrics, an agent-based evaluation protocol that aggregates Interaction Effect Fidelity, Generated Video Quality, and Camera-Object Controllability into an adaptive AgenticScore.
The paper successfully argues that current world models are over-optimized for visual fidelity at the expense of causal interaction coherence. The evaluation of 18 models reveals a stark gap: while most achieve >95% on motion smoothness, Interaction Effect Fidelity scores range from 42-67% (Table 1). However, the framework relies heavily on MLLM-based judgments for metric aggregation (AgenticScore) and semantic verification (InterCov), introducing potential instability and opacity. The benchmark claims to be "the first benchmark dedicated to assessing the interactive response capabilities," which is supported by the systematic coverage of interaction levels, though the empirical validation of these metrics against downstream task performance (e.g., planning accuracy) remains absent.
The three-level interaction hierarchy provides a principled scaffolding for evaluation prompts, progressing from "actions confined to the acting object" to effects that "influence multiple objects and lead to broader environmental changes" (Section 3.1). The quantitative results demonstrate meaningful discrimination between model paradigms, with Image-to-Video models generally outperforming Text-to-Video, supporting the design choice to condition on initial frames. The qualitative comparisons (Figures 5-6) align with quantitative scores, showing Matrix-Game2.0's "catastrophic collapse" versus Wan2.2's coherent motion.
The benchmark construction involves substantial manual intervention ("all generated captions are manually verified and refined," "all candidates are manually screened," Section 3.1) and proprietary MLLMs (ChatGPT-5.2, Gemini, DeepSeek-R1), which limits reproducibility. The InterCov metric (Equation 6) relies on a "VLM-based semantic verifier" yielding binary judgments without calibration for model uncertainty or bias. Additionally, AgenticScore's adaptive weighting (Equation 8) lacks ablation studies comparing it to fixed weights or demonstrating superior correlation with human judgments over component metrics. The 1,068-prompt suite, while diverse, may be insufficient for robust estimation across the full combinatorial space of physical principles annotated.
The evidence strongly supports the claim that visual quality does not imply interactive competence—models achieve 95-100% on Temporal Flickering and Motion Smoothness but only 37-55% on InterOrder and InterCov (Table 1). The comparison to VBench and WorldScore (Figure 2d) is accurate regarding coverage of interaction types and modalities. However, the paper does not establish predictive validity: whether higher AgenticScores correlate with improved performance on downstream world model applications (e.g., control, planning, counterfactual reasoning). Without this external validation, the benchmark measures "interactive response" as defined by the authors' metrics, but not necessarily as utility for the downstream tasks world models are intended to serve.
Reproduction is hindered by dataset construction dependencies on manual curation and closed-source MLLMs for prompt generation and metric computation. While Section 5.2 details inference hyperparameters for the 18 evaluated models, the Omni-WorldSuite construction pipeline requires human-in-the-loop verification that cannot be automated. The evaluation protocol depends on VLM APIs (for InterCov and AgenticScore) with potentially version-dependent behavior. Standard components (GroundingDINO, SAM, RAFT) are open-source, but the aggregation logic relies on "an MLLM to adaptively fuse these signals" (Section 4.5) without disclosed system prompts or calibration procedures. Code release is promised but not yet available.
Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.