Relax Forcing: Relaxed KV-Memory for Consistent Long Video Generation
Autoregressive video diffusion models struggle with minute-scale generation due to error accumulation in long-horizon rollouts. This paper challenges the assumption that more memory is better, proposing instead to decompose KV-cache conditioning into three functional roles—Sink for global anchors, Tail for recent continuity, and dynamically selected History for mid-range structure. The result is a training-free inference method that improves motion dynamics by 66.8% while cutting attention overhead by roughly 2.6×.
The paper presents a compelling, well-motivated thesis: temporal degradation in long videos stems from how memory is structured, not how much is stored. The proposed Relax Forcing mechanism is technically sound, training-free, and backed by systematic ablations that isolate the contributions of Sink, Tail, and History frames. While the absolute metric gains on VBench-Long are modest (≈1–5% over strong baselines), the dramatic improvement in Dynamic Degree (66.8% relative gain) suggests the method genuinely unlocks richer motion evolution without sacrificing stability.
The empirical analysis in Figure 2 is the strongest asset: it cleanly demonstrates that increasing Sink, History, or Tail memory beyond a small budget does not improve quality and can actually constrain motion dynamics. The decomposition of memory into heterogeneous functional roles (Figures 4–5) is well-executed, showing distinct failure modes when any component is removed. The relaxation scoring formulation $r(h) = S(h) - \lambda R(h)$ is simple yet principled, and the hybrid RoPE indexing for non-contiguous memory is a necessary technical detail that is thoroughly explained.
The evaluation is limited to a single benchmark (VBench-Long) with only 128 prompts from MovieGen, raising questions about generalisation to other domains or longer horizons beyond 60 seconds. Although the authors claim robustness to $\lambda$, the sensitivity plot (Figure 6) lacks exact numeric values, making it difficult to assess variance. Furthermore, the method assumes a specific chunk-wise sliding-window AR setup; its applicability to other autoregressive paradigms (e.g., per-token or hierarchical) is not discussed. Finally, the paper does not explicitly confirm whether code will be released, which limits reproducibility verification.
The comparisons to Self Forcing, Rolling Forcing, and Deep Forcing are fair and use identical inference configurations. The choice of VBench-Long metrics (Subject/Background Consistency, Dynamic Degree, etc.) is standard for the field. Claimed improvements are statistically meaningful on this benchmark, particularly the Dynamic Degree gains. However, the paper does not compare against very recent compression-based methods (e.g., PackCache) under identical throughput constraints, leaving open the question of whether the gains come purely from memory selection or also from reduced attention length.
Experimental details are comprehensive: the paper specifies exact hyperparameters (Sink=2, Tail=1, History=1 selected from 4 candidates, $\lambda=2.0$, chunk size $U=3$) and provides Algorithm 1 in the appendix. The latency breakdown in Appendix D is thorough, showing Flash Attention drops from 444.2 ms to 168.1 ms per block. However, no mention of code release appears in the provided text, and the reliance on a specific Self Forcing base model (which itself requires specific training) means independent reproduction requires access to that checkpoint. The candidate scoring overhead (3.4 ms) is negligible but should be confirmed on other hardware.
Autoregressive (AR) video diffusion has recently emerged as a promising paradigm for long video generation, enabling causal synthesis beyond the limits of bidirectional models. To address training-inference mismatch, a series of self-forcing strategies have been proposed to improve rollout stability by conditioning the model on its own predictions during training. While these approaches substantially mitigate exposure bias, extending generation to minute-scale horizons remains challenging due to progressive temporal degradation. In this work, we show that this limitation is not primarily caused by insufficient memory, but by how temporal memory is utilised during inference. Through empirical analysis, we find that increasing memory does not consistently improve long-horizon generation, and that the temporal placement of historical context significantly influences motion dynamics while leaving visual quality largely unchanged. These findings suggest that temporal memory should not be treated as a homogeneous buffer. Motivated by this insight, we introduce Relax Forcing, a structured temporal memory mechanism for AR diffusion. Instead of attending to the dense generated history, Relax Forcing decomposes temporal context into three functional roles: Sink for global stability, Tail for short-term continuity, and dynamically selected History for structural motion guidance, and selectively incorporates only the most relevant past information. This design mitigates error accumulation during extrapolation while preserving motion evolution. Experiments on VBench-Long demonstrate that Relax Forcing improves motion dynamics and overall temporal consistency while reducing attention overhead. Our results suggest that structured temporal memory is essential for scalable long video generation, complementing existing forcing-based training strategies.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.