WorldCache: Content-Aware Caching for Accelerated Video World Models
WorldCache addresses the prohibitive latency of Diffusion Transformers (DiTs) for video world models by replacing static feature caching with a content-aware dynamical approximation framework. The method introduces motion-adaptive thresholds, saliency-weighted drift estimation, and optimal feature blending to eliminate ghosting artifacts during fast motion. Achieving 2.3× speedup on Cosmos-Predict2.5 with 99.4% quality retention, it offers a training-free path toward interactive world simulation.
The paper presents a technically sound and well-engineered acceleration framework that convincingly outperforms existing training-free caching baselines on PAI-Bench. The four-module design addresses specific failure modes (global averaging, static thresholds, zero-order holds) with lightweight, composable solutions. However, the maximal speedup claim relies on a configuration that slightly degrades quality relative to intermediate ablations, and the reliance on concurrent preprints (dated 2026) limits historical contextualization.
The formalization of caching as a dynamical approximation problem is apt, and the specific mechanisms are well-motivated. Causal Feature Caching's velocity-dependent threshold $\tau_{\text{CFC}}(v_t) = \tau_0 / (1 + \alpha \cdot v_t)$ and Saliency-Weighted Drift's channel-variance weighting directly counteract the blindspots of global drift metrics. The least-squares optimal state interpolation (OSI) using vector projection $\gamma^{*} = \langle\Delta_{\text{tgt}}, \Delta_{\text{src}}\rangle / (\|\Delta_{\text{src}}\|^2 + \epsilon)$ is mathematically elegant and empirically validated to reduce error accumulation.
The ablation study (Table 4) reveals a tension: the configuration with CFC+SWD+OFA (without ATS) achieves an Overall score of 0.8035, matching the baseline, while the full WorldCache with ATS drops to 0.7977—a 0.7% relative degradation—to achieve the advertised 2.3× speedup. The "invest-and-spend" framing acknowledges this trade-off but obscures that the maximum speedup point sacrifices quality relative to intermediate configurations. Additionally, the evaluation focuses on Cosmos-Predict2.5 (2B and 14B), with WAN2.1 results relegated to a single table, limiting claims of broad transferability.
The evidence supports the claim that WorldCache dominates DiCache and FasterCache on the speed-quality frontier for video world models. On PAI-Bench I2W, WorldCache (2.3×) substantially outperforms DiCache (1.4×) and FasterCache (1.7×) in speed with competitive quality. However, comparisons in Appendix 0.C show that EasyCache and TeaCache (Slow) achieve higher quality scores (0.7979 and 0.7454 vs. 0.7977 and 0.7450) at conservative speedups (1.1–1.3×). The paper would benefit from clearer justification for prioritizing aggressive acceleration over conservative, higher-fidelity caching strategies.
Reproducibility is strong: hyperparameters are explicitly listed ($\tau_0=0.08$, $\alpha=2.0$, $\beta_s=0.12$, $\beta_d=4.0$), hardware is specified (NVIDIA H200), and the code is publicly available. The paper details the probe depth, ping-pong buffer implementation, and quadratic threshold decay curve $C(u)=u^2/6+u/2+10/3$. However, the motion-compensated warping relies on Lucas-Kanade flow with a spatial downsampling factor $s_{\text{flow}}$ in latent space; exact reproduction of the <3% overhead claim requires the specific implementation details (e.g., correlation window size) only partially described in Appendix 0.E.1.
Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose \textbf{WorldCache}, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves \textbf{2.3$\times$} inference speedup while preserving \textbf{99.4\%} of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on \href{https://umair1221.github.io/World-Cache/}{World-Cache}.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.