Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation

cs.CV cs.AI Yuyang You, Yongzhi Li, Jiahui Li, Yadong Mu, Quan Chen, Peng Jiang · Mar 23, 2026
Local to this browser
What it does
Video diffusion models suffer from prohibitive inference costs, but standard image distillation techniques like DMD cause severe oversaturation and temporal collapse when naively extended to video. This work introduces a video-specific...
Why it matters
Applied to Wan2. 1, the method enables stable 4-step synthesis with state-of-the-art VBench scores.
Main concern
The paper presents a technically sound and well-validated solution to video distillation artifacts. The adaptive regression mechanism is elegant—using a timestep-aware EMA cache to suppress gradient contributions from outlier samples...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Video diffusion models suffer from prohibitive inference costs, but standard image distillation techniques like DMD cause severe oversaturation and temporal collapse when naively extended to video. This work introduces a video-specific distillation framework featuring an adaptive regression loss that dynamically reweights real-data supervision to prevent color artifacts, a temporal variance regularizer to combat static output, and an inference-time frame interpolation module that halves sequence length during high-noise steps to accelerate generation. Applied to Wan2.1, the method enables stable 4-step synthesis with state-of-the-art VBench scores.

Critical review
Verdict
Bottom line

The paper presents a technically sound and well-validated solution to video distillation artifacts. The adaptive regression mechanism is elegant—using a timestep-aware EMA cache to suppress gradient contributions from outlier samples effectively resolves the tearing and object-fusion artifacts that plague naive regression. However, the temporal regularization loss lacks intrinsic stability and requires manual clipping to prevent hallucinatory frame jumps, suggesting the objective is somewhat ad-hoc. Additionally, the reliance on a proprietary 150K video dataset for regression supervision and the delicate balancing of three competing loss terms may hinder reproducibility and practical adoption.

“without any clipping, this loss causes the model to generate videos with severe frame jumps or hallucinatory artifacts in the later stages of training”
You et al., Appendix Sec. 9.1 · Appendix Sec. 9.1
“The regression loss, in contrast, is computed on a high-quality subset of 150,000 video samples, which we curated and cleaned from online sources”
You et al., Sec. 4.1 · Sec. 4.1
What holds up

The central claim that naive regression on real data causes object fusion while adaptive weighting fixes it is convincingly demonstrated. In ablations, naive regression drops Instance Preservation to 83.04 versus the DMD baseline of 88.88, whereas the adaptive loss restores it to 92.39—matching the teacher. The frame interpolation strategy is also practical, yielding a 30\% speedup (7.8s vs 10.8s for DMD on the 1.3B model) with negligible quality loss by exploiting the observation that high-noise steps exhibit minimal temporal variance.

“directly incorporating the regression loss leads to a significant drop in the Instance Preservation score... When replaced with the adaptive regression loss, the score improves markedly, even surpassing the DMD baseline”
You et al., Sec. 4.4 · Sec. 4.4
“reduces the frame rate by half during the high-noise stage (e.g., first 2 of 4 steps)”
You et al., Sec. 3.4 · Sec. 3.4
Main concerns

The temporal regularization loss $\mathcal{L}_{\text{temp}}=-\log(\mathbb{E}_{x\sim p_{\theta}}[\mathrm{Var}(x)]+\epsilon)$ lacks a convergence mechanism and must be clipped at 0.6 to prevent 'severe frame jumps,' indicating the loss can destabilize training if left unbounded. The paper also claims the method supports simultaneous fine-tuning to new domains (e.g., anime) during distillation, but provides only a single qualitative example without dataset details or quantitative validation. Furthermore, the distribution matching loss uses text-only conditions while the regression loss requires high-quality proprietary videos, creating a data imbalance that may limit generalization.

“This can lead to numerical instability, excessively large gradients, and potentially exploding gradients”
You et al., Appendix Sec. 9.1 · Appendix Sec. 9.1
“the distribution matching loss is conditioned on text annotations from a mixed dataset... without direct use of the video data itself”
You et al., Sec. 4.1 · Sec. 4.1
Evidence and comparison

Quantitative evidence is robust across VBench1 and VBench2 benchmarks, with the method achieving the highest total scores against DMD, LCM, PCM, DCM, and rCM baselines on both 1.3B and 14B models. Table 2 effectively isolates components: temporal regularization boosts Dynamic Degree from 72.22 to 97.77, while adaptive regression fixes the Instance Preservation degradation caused by naive regression. However, comparisons to non-distillation acceleration methods (e.g., improved ODE solvers) are absent, and the user study relies on only 12 annotators with majority voting, offering limited statistical rigor.

“+TR+RegLoss... Instance Preservation 83.04... +TR+AdaLoss... Instance Preservation 92.39”
You et al., Table 2 · Table 2
“We recruited 12 professional, independent annotators; each paired sample was evaluated by at least three distinct annotators”
You et al., Sec. 4.2 · Sec. 4.2
Reproducibility

The authors release source code and provide detailed hyperparameters ($\alpha=0.95$, $k=3.0$, $\omega_{\text{temp}}=0.05$), which aids reproduction. However, the regression loss depends on a curated proprietary dataset of 150K videos that is not publicly available, potentially blocking exact reproduction of the reported scores. Training requires 24 GPUs, which is resource-intensive but feasible. The frame interpolation module requires pre-training a separate UNet for 10,000 iterations on VAE latents, adding engineering complexity but is fully specified.

“The regression loss, in contrast, is computed on a high-quality subset of 150,000 video samples, which we curated”
You et al., Sec. 4.1 · Sec. 4.1
“We train the U-Net interpolation module on a dataset of 150,000 real-world video clips... for 10,000 iterations”
You et al., Appendix Sec. 8 · Appendix Sec. 8
Abstract

Video generation has recently emerged as a central task in the field of generative AI. However, the substantial computational cost inherent in video synthesis makes model distillation a critical technique for efficient deployment. Despite its significance, there is a scarcity of methods specifically designed for video diffusion models. Prevailing approaches often directly adapt image distillation techniques, which frequently lead to artifacts such as oversaturation, temporal inconsistency, and mode collapse. To address these challenges, we propose a novel distillation framework tailored specifically for video diffusion models. Its core innovations include: (1) an adaptive regression loss that dynamically adjusts spatial supervision weights to prevent artifacts arising from excessive distribution shifts; (2) a temporal regularization loss to counteract temporal collapse, promoting smooth and physically plausible sampling trajectories; and (3) an inference-time frame interpolation strategy that reduces sampling overhead while preserving perceptual quality. Extensive experiments and ablation studies on the VBench and VBench2 benchmarks demonstrate that our method achieves stable few-step video synthesis, significantly enhancing perceptual fidelity and motion realism. It consistently outperforms existing distillation baselines across multiple metrics.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.