Manifold-Aware Exploration for Reinforcement Learning in Video Generation

cs.CV cs.AI Mingzhe Zheng, Weijie Kong, Yue Wu, Dengyang Jiang, Yue Ma, Xuanhua He, Bin Lin, Kaixiong Gong, Zhao Zhong, Liefeng Bo, Qifeng Chen, Harry Yang · Mar 23, 2026
Local to this browser
What it does
This paper tackles the instability of Group Relative Policy Optimization (GRPO) when applied to video generation. The core problem is that converting deterministic ODE samplers to SDE for exploration injects excess noise in high-noise...
Why it matters
The method is evaluated on HunyuanVideo1. 5 using VideoAlign rewards, showing improvements over DanceGRPO, FlowGRPO, and CPS.
Main concern
SAGE-GRPO presents a technically sound solution to an important problem—stabilizing GRPO for high-dimensional video generation. The micro-level SDE correction and gradient equalizer are well-derived, and the macro-level Dual Trust Region...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper tackles the instability of Group Relative Policy Optimization (GRPO) when applied to video generation. The core problem is that converting deterministic ODE samplers to SDE for exploration injects excess noise in high-noise regimes, causing off-manifold drift that degrades rollout quality and destabilizes reward updates. SAGE-GRPO introduces a precise SDE with logarithmic curvature correction to keep exploration closer to the flow trajectory, plus a Dual Trust Region mechanism combining periodic moving anchors with stepwise KL constraints to prevent long-horizon drift. The method is evaluated on HunyuanVideo1.5 using VideoAlign rewards, showing improvements over DanceGRPO, FlowGRPO, and CPS.

Critical review
Verdict
Bottom line

SAGE-GRPO presents a technically sound solution to an important problem—stabilizing GRPO for high-dimensional video generation. The micro-level SDE correction and gradient equalizer are well-derived, and the macro-level Dual Trust Region offers a practical position-velocity control mechanism. Results on HunyuanVideo1.5 show clear gains in Setting B (alignment-focused) with SAGE-GRPO achieving 0.8066 overall reward versus 0.4773 for FlowGRPO and 0.3694 for CPS. Table 2 shows that while the Dual Moving KL variant performs strongly, the w/o KL variant underperforms CPS in Setting A, suggesting the full method is necessary for best results. The manuscript is clearly written and the ablation studies (Figures 3, 7, 8) validate individual components.

“SAGE-GRPO w/ Dual Mov KL achieves 0.8066 Overall in Setting B vs 0.4773 for FlowGRPO w/o KL and 0.3694 for CPS w/o KL”
paper · Table 2
“Σ_t^{1/2} = η√[-(σ_t - σ_{t+1}) + log((1-σ_{t+1})/(1-σ_t))]”
paper · Section 3.2.1
What holds up

The derivation of the Precise Manifold-Aware SDE (Equation 6) is rigorous: the authors integrate the diffusion coefficient over each step rather than using first-order approximations, yielding a logarithmic correction term that accounts for geometric contraction of the signal coefficient. Figure 2 illustrates the geometric intuition well—keeping exploration noise tangent to the manifold versus the larger off-manifold regions of FlowGRPO. The Gradient Norm Equalizer (Equation 9) addresses a real issue: the paper demonstrates empirically (Figure 4) that gradient norms follow ||∇log π|| ∝ 1/Σ_t^{1/2}, varying by orders of magnitude across timesteps. The Dual Trust Region formulation (Equation 14) combining position control (periodic moving anchor) and velocity control (step-wise KL) is conceptually elegant and the ablation in Figure 8 confirms Dual Moving KL achieves the highest and most stable rewards.

“Σ_t^{1/2} = η√[-(σ_t - σ_{t+1}) + log((1-σ_{t+1})/(1-σ_t))]”
paper · Equation 6
“Observed norms (blue) decrease rapidly as σ increases and match the predicted relationship (red) ||∇log π|| ∝ 1/Σ_t^{1/2}”
paper · Figure 4
“L_KL = β_pos · D_KL(π_θ||π_ref_N) + β_vel · D_KL(π_θ||π_{k-1})”
paper · Equation 14
Main concerns

The "manifold-aware" framing is more metaphorical than technical—there is no explicit manifold learning or constraint enforcement on ℳ ⊂ ℝ^D, only an SDE that stays closer to the flow trajectory. The claim that existing methods inject "excess noise energy" is qualitatively illustrated in Figure 1 but lacks quantitative verification of off-manifold distance. Table 2 reveals inconsistent performance: in Setting A, SAGE-GRPO w/o KL (0.4859) significantly underperforms CPS w/o KL (0.6343), suggesting the gradient equalizer alone is insufficient without the Dual KL mechanism. The logarithmic correction term log((1-σ_{t+1})/(1-σ_t)) becomes unstable when σ approaches 1; while the paper mentions clamping, this ad-hoc fix undermines the "precise" claim. Finally, the user study (29 evaluators, 32 prompts) is small and the paper does not report statistical significance or inter-annotator agreement.

“We treat this model as defining a valid data manifold M ⊂ ℝ^D”
paper · Section 1
“SAGE-GRPO w/o KL: 0.4859 Overall in Setting A vs CPS w/o KL: 0.6343”
paper · Table 2
“User preference study with 29 evaluators on 32 prompts”
paper · Section 4.4
Evidence and comparison

The evidence qualitatively supports the main claims but has gaps. The comparison to FlowGRPO and DanceGRPO in Table 1 correctly identifies their first-order approximations: DanceGRPO uses η√(σ_t - σ_{t+1}) and FlowGRPO uses η√(σ_t/(1-σ_t)(σ_t - σ_{t+1})), both of which the paper argues are approximations of the integral-derived variance. This matches the FlowGRPO paper (Liu et al., 2025b) which indeed uses that formulation. Figure 3 shows that without the Gradient Equalizer, reward curves plateau or become unstable, validating its necessity. However, the paper does not provide evidence that the "manifold" interpretation is more than a geometric intuition—there are no measurements of actual manifold distance or tangent space alignment. The comparison to CPS (Wang & Yu, 2025) is fair in that both use flow matching, though CPS focuses on preserving coefficients rather than exploration constraints. The use of VideoAlign without fine-tuning (frozen evaluator) is appropriate for consistent comparison.

“DanceGRPO: η√(σ_t - σ_{t+1}); FlowGRPO: η√(σ_t/(1-σ_t)(σ_t - σ_{t+1}))”
paper · Table 1
“ODE-to-SDE conversion that transforms a deterministic ODE into an equivalent SDE”
Liu et al., 2025b · FlowGRPO paper
“We use VideoAlign as the reward oracle... using the original VideoAlign model as a frozen evaluator (no reward-model fine-tuning)”
paper · Section 4.1
Reproducibility

The paper claims code will be available at a GitHub Pages link but this appears to be a placeholder at time of review. Critical hyperparameters are provided: per-GPU batch size 22, 44 gradient accumulation steps (effective batch 88), 81 frames per video, GRPO updates every 20 sampling steps, KL weight λ_KL ∈ [10^{-7}, 10^{-5}]. The Dual Trust Region requires two coefficients β_pos and β_vel (Equation 14) but their values are not explicitly stated in the main text. The gradient equalizer uses a median heuristic with small constant ϵ (Equation 9) but ϵ's value is unspecified. The moving anchor update interval N is mentioned but its value (e.g., every 100 steps) is not clearly stated in the main experimental setup. Without the code release, these missing details would block exact reproduction, though the appendix may contain them.

“per-GPU batch size 22 and 44 gradient accumulation steps (effective batch size 88)... 81 frames... GRPO updates every 20 sampling steps”
paper · Section 4.1
“Code and visual gallery are available at here”
paper · Abstract
Abstract

Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment. To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift. We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at https://dungeonmassster.github.io/SAGE-GRPO-Page/.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.