Relax Forcing: Relaxed KV-Memory for Consistent Long Video Generation

cs.CV Zengqun Zhao, Yanzuo Lu, Ziquan Liu, Jifei Song, Jiankang Deng, Ioannis Patras · Mar 22, 2026

What it does

Why it matters

8% while cutting attention overhead by roughly 2. 6×.

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Autoregressive video diffusion models struggle with minute-scale generation due to error accumulation in long-horizon rollouts. This paper challenges the assumption that more memory is better, proposing instead to decompose KV-cache conditioning into three functional roles—Sink for global anchors, Tail for recent continuity, and dynamically selected History for mid-range structure. The result is a training-free inference method that improves motion dynamics by 66.8% while cutting attention overhead by roughly 2.6×.

Critical review

Verdict

Bottom line

The paper presents a compelling, well-motivated thesis: temporal degradation in long videos stems from how memory is structured, not how much is stored. The proposed Relax Forcing mechanism is technically sound, training-free, and backed by systematic ablations that isolate the contributions of Sink, Tail, and History frames. While the absolute metric gains on VBench-Long are modest (≈1–5% over strong baselines), the dramatic improvement in Dynamic Degree (66.8% relative gain) suggests the method genuinely unlocks richer motion evolution without sacrificing stability.

“We conduct a systematic study of how temporal memory influences long-video extrapolation in AR diffusion. Our analysis identifies dense historical conditioning as a key bottleneck that limits long-horizon motion evolution.”

paper · Section 1, contributions

“Relax Forcing achieves ... Dynamic Degree ... 65.67 ... Deep Forcing ... 57.56”

paper · Table 1

What holds up

The empirical analysis in Figure 2 is the strongest asset: it cleanly demonstrates that increasing Sink, History, or Tail memory beyond a small budget does not improve quality and can actually constrain motion dynamics. The decomposition of memory into heterogeneous functional roles (Figures 4–5) is well-executed, showing distinct failure modes when any component is removed. The relaxation scoring formulation $r(h) = S(h) - \lambda R(h)$ is simple yet principled, and the hybrid RoPE indexing for non-contiguous memory is a necessary technical detail that is thoroughly explained.

“excessive memory often constrains motion evolution rather than enhancing temporal coherence, indicating that dense conditioning may introduce redundancy instead of useful guidance”

paper · Section 3.2

“r(h) = S(h) - \lambda R(h), where \lambda controls the trade-off between global stability and local redundancy”

paper · Section 3.3

Main concerns

The evaluation is limited to a single benchmark (VBench-Long) with only 128 prompts from MovieGen, raising questions about generalisation to other domains or longer horizons beyond 60 seconds. Although the authors claim robustness to $\lambda$, the sensitivity plot (Figure 6) lacks exact numeric values, making it difficult to assess variance. Furthermore, the method assumes a specific chunk-wise sliding-window AR setup; its applicability to other autoregressive paradigms (e.g., per-token or hierarchical) is not discussed. Finally, the paper does not explicitly confirm whether code will be released, which limits reproducibility verification.

“We evaluate long-horizon video generation using the VBench-Long benchmark ... 128 prompts from MovieGen”

paper · Section 4.1

“Varying \lambda leads to minor changes in visual consistency and image quality, while moderately influencing motion dynamics”

paper · Figure 6 caption

Evidence and comparison

The comparisons to Self Forcing, Rolling Forcing, and Deep Forcing are fair and use identical inference configurations. The choice of VBench-Long metrics (Subject/Background Consistency, Dynamic Degree, etc.) is standard for the field. Claimed improvements are statistically meaningful on this benchmark, particularly the Dynamic Degree gains. However, the paper does not compare against very recent compression-based methods (e.g., PackCache) under identical throughput constraints, leaving open the question of whether the gains come purely from memory selection or also from reduced attention length.

“Relax Forcing achieves the highest overall score in both regimes ... For 30-second videos, our method reaches 80.87%, outperforming the strongest training-free baseline, Deep Forcing, by +0.93%”

paper · Section 4.2

Reproducibility

Experimental details are comprehensive: the paper specifies exact hyperparameters (Sink=2, Tail=1, History=1 selected from 4 candidates, $\lambda=2.0$, chunk size $U=3$) and provides Algorithm 1 in the appendix. The latency breakdown in Appendix D is thorough, showing Flash Attention drops from 444.2 ms to 168.1 ms per block. However, no mention of code release appears in the provided text, and the reliance on a specific Self Forcing base model (which itself requires specific training) means independent reproduction requires access to that checkpoint. The candidate scoring overhead (3.4 ms) is negligible but should be confirmed on other hardware.

“the number of sink frames is set to 2, the number of tail frames is set to 1, and one history frame is selected from a candidate pool of size 4. The coefficient \lambda is set to 2.0”

paper · Section 4.1

“Flash Attention ... 444.2 ms ... 168.1 ms ... Candidate Scoring ... 3.4 ms”

paper · Appendix D, Table 0.D.1

Abstract

Autoregressive (AR) video diffusion has recently emerged as a promising paradigm for long video generation, enabling causal synthesis beyond the limits of bidirectional models. To address training-inference mismatch, a series of self-forcing strategies have been proposed to improve rollout stability by conditioning the model on its own predictions during training. While these approaches substantially mitigate exposure bias, extending generation to minute-scale horizons remains challenging due to progressive temporal degradation. In this work, we show that this limitation is not primarily caused by insufficient memory, but by how temporal memory is utilised during inference. Through empirical analysis, we find that increasing memory does not consistently improve long-horizon generation, and that the temporal placement of historical context significantly influences motion dynamics while leaving visual quality largely unchanged. These findings suggest that temporal memory should not be treated as a homogeneous buffer. Motivated by this insight, we introduce Relax Forcing, a structured temporal memory mechanism for AR diffusion. Instead of attending to the dense generated history, Relax Forcing decomposes temporal context into three functional roles: Sink for global stability, Tail for short-term continuity, and dynamically selected History for structural motion guidance, and selectively incorporates only the most relevant past information. This design mitigates error accumulation during extrapolation while preserving motion evolution. Experiments on VBench-Long demonstrate that Relax Forcing improves motion dynamics and overall temporal consistency while reducing attention overhead. Our results suggest that structured temporal memory is essential for scalable long video generation, complementing existing forcing-based training strategies.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.