Envisioning the Future, One Step at a Time
Predicting how complex scenes evolve is essential for intelligent systems, yet dense video generation expends enormous compute on appearance rather than dynamics. This paper introduces Myriad, an autoregressive diffusion model that predicts future motion via sparse point trajectories, explicitly avoiding the 'visual tax' of pixel-level generation. By modeling step-wise uncertainty accumulation through flow matching and utilizing fused transformer blocks, the method achieves throughput of 2200 samples/min compared to less than 1 for video models, while matching or exceeding their predictive accuracy on motion-focused benchmarks.
The paper presents a compelling case for dynamics-centric future prediction, demonstrating that sparse trajectory modeling can match dense video simulators in accuracy while delivering orders-of-magnitude speedups. The technical contributions—including the flow matching posterior with scale cascade and efficient fused reasoning blocks—are sound and well-motivated. However, the evaluation relies heavily on a static camera assumption and metrics that favor high-sample exploration, limiting real-world applicability compared to general video generation models.
The core argument that sparse trajectory prediction avoids the 'visual tax' is convincingly demonstrated through throughput comparisons showing three orders of magnitude speedup over dense video models. The technical implementation is sophisticated: the scale cascade effectively addresses the heavy-tailed motion distribution where 'excess kurtosis $\kappa$ in the hundreds instead of around 0' (Section 3, Scale Cascade), and the flow matching head provides faster convergence than GMM alternatives. The autoregressive factorization $p_\theta(\mathbf{x}_{1:T} \mid \mathbf{x}_0, \mathcal{I}_0) = \prod_{t=1}^T p_\theta(\mathbf{x}_t \mid \mathbf{x}_{
The primary limitation is the restrictive static camera assumption: 'Our main formulation assumes a static camera... a setting that contemporary video generation baselines already handle' (Limitations). This significantly narrows applicability compared to the dense video models used as baselines. Furthermore, the model inherits 'biases and failure modes' from off-the-shelf trackers used for pseudo-ground truth training (Limitations). The evaluation metric $\minADE_N$ inherently favors methods that generate thousands of samples, yet practical applications often require accurate single-shot predictions rather than coverage of the hypothesis space. Finally, by predicting only sparse trajectories, the method cannot model appearance changes, deformations, or occlusion dynamics that dense video models capture.
Comparisons to dense video models (MAGI-1, SVD, etc.) show the proposed method achieves '$0.029$ $\minADE$ vs $0.037$ for MAGI-1' on OWM (Table 1a), but these baselines solve a strictly harder problem involving joint appearance and motion generation. The OWM benchmark itself comprises only 95 curated videos with static cameras, which is small for evaluating open-set generalization. While the method demonstrates strong performance on physics diagnostics (PhysicsIQ, Physion), the claim of surpassing dense models should be tempered by recognition that those models were not optimized solely for trajectory prediction and handle dynamic cameras.
The paper provides detailed architectural specifications ($665$M parameters, DINOv3-L/16 encoder, flow matching head depth $3$) and training hyperparameters (batch size $128$, $400$k steps, AdamW with peak LR $3\text{e-}5$). However, reproducibility is hindered by reliance on pseudo-ground truth from proprietary trackers (TAPNext, V-DPM) for training data generation. The paper does not explicitly state that code will be released—only that 'OWM is solely used for evaluation and will be made publicly available' (Section 4.1)—while the project page is referenced without explicit code availability. Independent reproduction would require reimplementing the fused transformer blocks and scale cascade architecture exactly as described.
Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large-scale exploration of future hypotheses costly and limits performance when long-horizon, multi-modal motion is essential. We address this by formulating the prediction of open-set future scene dynamics as step-wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics-centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long-range coherence. We further introduce OWM, a benchmark for open-set motion prediction based on diverse in-the-wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real-world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed, making open-set future prediction both scalable and practical. Project page: http://compvis.github.io/myriad.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.