Envisioning the Future, One Step at a Time

cs.CV cs.AI cs.LG cs.CV Stefan Andreas Baumann, Jannik Wiese, Tommaso Martorella, Mahdi M. Kalayeh, Björn Ommer · Apr 10, 2026

What it does

Why it matters

This paper introduces Myriad, an autoregressive diffusion model that predicts future motion via sparse point trajectories, explicitly avoiding the 'visual tax' of pixel-level generation. By modeling step-wise uncertainty accumulation...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Predicting how complex scenes evolve is essential for intelligent systems, yet dense video generation expends enormous compute on appearance rather than dynamics. This paper introduces Myriad, an autoregressive diffusion model that predicts future motion via sparse point trajectories, explicitly avoiding the 'visual tax' of pixel-level generation. By modeling step-wise uncertainty accumulation through flow matching and utilizing fused transformer blocks, the method achieves throughput of 2200 samples/min compared to less than 1 for video models, while matching or exceeding their predictive accuracy on motion-focused benchmarks.

Critical review

Verdict

Bottom line

The paper presents a compelling case for dynamics-centric future prediction, demonstrating that sparse trajectory modeling can match dense video simulators in accuracy while delivering orders-of-magnitude speedups. The technical contributions—including the flow matching posterior with scale cascade and efficient fused reasoning blocks—are sound and well-motivated. However, the evaluation relies heavily on a static camera assumption and metrics that favor high-sample exploration, limiting real-world applicability compared to general video generation models.

“Our main formulation assumes a static camera, which simplifies evaluation and improves interpretability of predictions, but limits applicability to scenes with ego-motion or dynamic viewpoints”

paper · Limitations

“Myriad (Ours) ... 2200 ... 0.029 ... MAGI-1 ... 0.303 ... 0.037”

paper · Table 1

What holds up

The core argument that sparse trajectory prediction avoids the 'visual tax' is convincingly demonstrated through throughput comparisons showing three orders of magnitude speedup over dense video models. The technical implementation is sophisticated: the scale cascade effectively addresses the heavy-tailed motion distribution where 'excess kurtosis $\kappa$ in the hundreds instead of around 0' (Section 3, Scale Cascade), and the flow matching head provides faster convergence than GMM alternatives. The autoregressive factorization $p_\theta(\mathbf{x}_{1:T} \mid \mathbf{x}_0, \mathcal{I}_0) = \prod_{t=1}^T p_\theta(\mathbf{x}_t \mid \mathbf{x}_{

“Motion shows significant heavy tail-like behavior, unlike typical image distributions for which similar heads were previously applied, with excess kurtosis $\kappa$ in the hundreds instead of around 0”

paper · Section 3, Scale Cascade

“Substituting previously used GMM-based heads with flow matching heads leads to significant improvements in accuracy and increases convergence substantially”

paper · Table 3

Main concerns

The primary limitation is the restrictive static camera assumption: 'Our main formulation assumes a static camera... a setting that contemporary video generation baselines already handle' (Limitations). This significantly narrows applicability compared to the dense video models used as baselines. Furthermore, the model inherits 'biases and failure modes' from off-the-shelf trackers used for pseudo-ground truth training (Limitations). The evaluation metric $\minADE_N$ inherently favors methods that generate thousands of samples, yet practical applications often require accurate single-shot predictions rather than coverage of the hypothesis space. Finally, by predicting only sparse trajectories, the method cannot model appearance changes, deformations, or occlusion dynamics that dense video models capture.

“Our model relies on pseudo ground-truth trajectories from off-the-shelf trackers for training, inheriting their biases and failure modes”

paper · Limitations

“From the multiple generated hypotheses, we compute prediction error via the pointwise distance... minADE”

paper · Section 4.2

Evidence and comparison

Comparisons to dense video models (MAGI-1, SVD, etc.) show the proposed method achieves '$0.029$ $\minADE$ vs $0.037$ for MAGI-1' on OWM (Table 1a), but these baselines solve a strictly harder problem involving joint appearance and motion generation. The OWM benchmark itself comprises only 95 curated videos with static cameras, which is small for evaluating open-set generalization. While the method demonstrates strong performance on physics diagnostics (PhysicsIQ, Physion), the claim of surpassing dense models should be tempered by recognition that those models were not optimized solely for trajectory prediction and handle dynamic cameras.

“We curate a set of 95 diverse in-the-wild videos selected for varied motion dynamics”

paper · Section 4.1

“Eliminating the need to model fine-grained pixel-level details lets our model focus on the dynamics of the scene, making it competitive with state-of-the-art video models”

paper · Table 1

Reproducibility

The paper provides detailed architectural specifications ($665$M parameters, DINOv3-L/16 encoder, flow matching head depth $3$) and training hyperparameters (batch size $128$, $400$k steps, AdamW with peak LR $3\text{e-}5$). However, reproducibility is hindered by reliance on pseudo-ground truth from proprietary trackers (TAPNext, V-DPM) for training data generation. The paper does not explicitly state that code will be released—only that 'OWM is solely used for evaluation and will be made publicly available' (Section 4.1)—while the project page is referenced without explicit code availability. Independent reproduction would require reimplementing the fused transformer blocks and scale cascade architecture exactly as described.

“In total, we have 665M trainable parameters... peak learning rate of 3e-5... 400k steps”

paper · Section 5.1

“OWM is solely used for evaluation and will be made publicly available”

paper · Section 4.1

Abstract

Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large-scale exploration of future hypotheses costly and limits performance when long-horizon, multi-modal motion is essential. We address this by formulating the prediction of open-set future scene dynamics as step-wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics-centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long-range coherence. We further introduce OWM, a benchmark for open-set motion prediction based on diverse in-the-wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real-world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed, making open-set future prediction both scalable and practical. Project page: http://compvis.github.io/myriad.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.