Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

cs.CV Meiqi Wu, Zhixin Cai, Fufangchen Zhao, Xiaokun Feng, Rujing Dang, Bingze Song, Ruitian Tian, Jiashu Zhu, Jiachen Lei, Hao Dou, Jing Tang, Lei Sun, Jiahong Wu, Xiangxiang Chu, Zeming Liu, Kaiqi Huang · Mar 23, 2026

What it does

Why it matters

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Omni-WorldBench addresses the gap between passive video generation metrics and active world model evaluation by focusing on interactive response—how actions causally drive state transitions across space and time. It introduces Omni-WorldSuite, a 1,068-prompt hierarchical taxonomy spanning three interaction levels (single-object to global environmental effects), and Omni-Metrics, an agent-based evaluation protocol that aggregates Interaction Effect Fidelity, Generated Video Quality, and Camera-Object Controllability into an adaptive AgenticScore.

Critical review

Verdict

Bottom line

The paper successfully argues that current world models are over-optimized for visual fidelity at the expense of causal interaction coherence. The evaluation of 18 models reveals a stark gap: while most achieve >95% on motion smoothness, Interaction Effect Fidelity scores range from 42-67% (Table 1). However, the framework relies heavily on MLLM-based judgments for metric aggregation (AgenticScore) and semantic verification (InterCov), introducing potential instability and opacity. The benchmark claims to be "the first benchmark dedicated to assessing the interactive response capabilities," which is supported by the systematic coverage of interaction levels, though the empirical validation of these metrics against downstream task performance (e.g., planning accuracy) remains absent.

“Although these metrics are effective in measuring visual fidelity and text–video alignment, they do not adequately capture the core capability of world models—the ability to generate consistent and plausible responses under varying interaction conditions.”

paper · Section 1

“Wan2.2 achieves 67.34% average Interaction Effect Fidelity while scoring 99.09% on Motion Smoothness and 98.36% on Temporal Flickering.”

paper · Table 1

What holds up

The three-level interaction hierarchy provides a principled scaffolding for evaluation prompts, progressing from "actions confined to the acting object" to effects that "influence multiple objects and lead to broader environmental changes" (Section 3.1). The quantitative results demonstrate meaningful discrimination between model paradigms, with Image-to-Video models generally outperforming Text-to-Video, supporting the design choice to condition on initial frames. The qualitative comparisons (Figures 5-6) align with quantitative scores, showing Matrix-Game2.0's "catastrophic collapse" versus Wan2.2's coherent motion.

Main concerns

The benchmark construction involves substantial manual intervention ("all generated captions are manually verified and refined," "all candidates are manually screened," Section 3.1) and proprietary MLLMs (ChatGPT-5.2, Gemini, DeepSeek-R1), which limits reproducibility. The InterCov metric (Equation 6) relies on a "VLM-based semantic verifier" yielding binary judgments without calibration for model uncertainty or bias. Additionally, AgenticScore's adaptive weighting (Equation 8) lacks ablation studies comparing it to fixed weights or demonstrating superior correlation with human judgments over component metrics. The 1,068-prompt suite, while diverse, may be insufficient for robust estimation across the full combinatorial space of physical principles annotated.

“all generated captions are manually verified and refined to ensure consistency with the source sequence”

paper · Section 3.1

“We employ a VLM-based semantic verifier to evaluate the video sequence, yielding a binary validity signal $v_o \in \{0,1\}$ for each entity”

paper · Section 4.4

Evidence and comparison

The evidence strongly supports the claim that visual quality does not imply interactive competence—models achieve 95-100% on Temporal Flickering and Motion Smoothness but only 37-55% on InterOrder and InterCov (Table 1). The comparison to VBench and WorldScore (Figure 2d) is accurate regarding coverage of interaction types and modalities. However, the paper does not establish predictive validity: whether higher AgenticScores correlate with improved performance on downstream world model applications (e.g., control, planning, counterfactual reasoning). Without this external validation, the benchmark measures "interactive response" as defined by the authors' metrics, but not necessarily as utility for the downstream tasks world models are intended to serve.

“Current models are already strong in conventional video quality metrics, but still show clear limitations in action-conditioned world evolution, causal interaction consistency, and joint camera-object control.”

paper · Section 5.3

Reproducibility

Reproduction is hindered by dataset construction dependencies on manual curation and closed-source MLLMs for prompt generation and metric computation. While Section 5.2 details inference hyperparameters for the 18 evaluated models, the Omni-WorldSuite construction pipeline requires human-in-the-loop verification that cannot be automated. The evaluation protocol depends on VLM APIs (for InterCov and AgenticScore) with potentially version-dependent behavior. Standard components (GroundingDINO, SAM, RAFT) are open-source, but the aggregation logic relies on "an MLLM to adaptively fuse these signals" (Section 4.5) without disclosed system prompts or calibration procedures. Code release is promised but not yet available.

“The aggregation agent then analyzes the relative importance of these three evaluation dimensions using an MLLM conditioned on the evaluation prompt, and maps the resulting ranking to predefined weight coefficients $w_1, w_2, w_3$.”

paper · Section 4.5

Abstract

Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.