Manifold-Aware Exploration for Reinforcement Learning in Video Generation
This paper tackles the instability of Group Relative Policy Optimization (GRPO) when applied to video generation. The core problem is that converting deterministic ODE samplers to SDE for exploration injects excess noise in high-noise regimes, causing off-manifold drift that degrades rollout quality and destabilizes reward updates. SAGE-GRPO introduces a precise SDE with logarithmic curvature correction to keep exploration closer to the flow trajectory, plus a Dual Trust Region mechanism combining periodic moving anchors with stepwise KL constraints to prevent long-horizon drift. The method is evaluated on HunyuanVideo1.5 using VideoAlign rewards, showing improvements over DanceGRPO, FlowGRPO, and CPS.
SAGE-GRPO presents a technically sound solution to an important problem—stabilizing GRPO for high-dimensional video generation. The micro-level SDE correction and gradient equalizer are well-derived, and the macro-level Dual Trust Region offers a practical position-velocity control mechanism. Results on HunyuanVideo1.5 show clear gains in Setting B (alignment-focused) with SAGE-GRPO achieving 0.8066 overall reward versus 0.4773 for FlowGRPO and 0.3694 for CPS. Table 2 shows that while the Dual Moving KL variant performs strongly, the w/o KL variant underperforms CPS in Setting A, suggesting the full method is necessary for best results. The manuscript is clearly written and the ablation studies (Figures 3, 7, 8) validate individual components.
The derivation of the Precise Manifold-Aware SDE (Equation 6) is rigorous: the authors integrate the diffusion coefficient over each step rather than using first-order approximations, yielding a logarithmic correction term that accounts for geometric contraction of the signal coefficient. Figure 2 illustrates the geometric intuition well—keeping exploration noise tangent to the manifold versus the larger off-manifold regions of FlowGRPO. The Gradient Norm Equalizer (Equation 9) addresses a real issue: the paper demonstrates empirically (Figure 4) that gradient norms follow ||∇log π|| ∝ 1/Σ_t^{1/2}, varying by orders of magnitude across timesteps. The Dual Trust Region formulation (Equation 14) combining position control (periodic moving anchor) and velocity control (step-wise KL) is conceptually elegant and the ablation in Figure 8 confirms Dual Moving KL achieves the highest and most stable rewards.
The "manifold-aware" framing is more metaphorical than technical—there is no explicit manifold learning or constraint enforcement on ℳ ⊂ ℝ^D, only an SDE that stays closer to the flow trajectory. The claim that existing methods inject "excess noise energy" is qualitatively illustrated in Figure 1 but lacks quantitative verification of off-manifold distance. Table 2 reveals inconsistent performance: in Setting A, SAGE-GRPO w/o KL (0.4859) significantly underperforms CPS w/o KL (0.6343), suggesting the gradient equalizer alone is insufficient without the Dual KL mechanism. The logarithmic correction term log((1-σ_{t+1})/(1-σ_t)) becomes unstable when σ approaches 1; while the paper mentions clamping, this ad-hoc fix undermines the "precise" claim. Finally, the user study (29 evaluators, 32 prompts) is small and the paper does not report statistical significance or inter-annotator agreement.
The evidence qualitatively supports the main claims but has gaps. The comparison to FlowGRPO and DanceGRPO in Table 1 correctly identifies their first-order approximations: DanceGRPO uses η√(σ_t - σ_{t+1}) and FlowGRPO uses η√(σ_t/(1-σ_t)(σ_t - σ_{t+1})), both of which the paper argues are approximations of the integral-derived variance. This matches the FlowGRPO paper (Liu et al., 2025b) which indeed uses that formulation. Figure 3 shows that without the Gradient Equalizer, reward curves plateau or become unstable, validating its necessity. However, the paper does not provide evidence that the "manifold" interpretation is more than a geometric intuition—there are no measurements of actual manifold distance or tangent space alignment. The comparison to CPS (Wang & Yu, 2025) is fair in that both use flow matching, though CPS focuses on preserving coefficients rather than exploration constraints. The use of VideoAlign without fine-tuning (frozen evaluator) is appropriate for consistent comparison.
The paper claims code will be available at a GitHub Pages link but this appears to be a placeholder at time of review. Critical hyperparameters are provided: per-GPU batch size 22, 44 gradient accumulation steps (effective batch 88), 81 frames per video, GRPO updates every 20 sampling steps, KL weight λ_KL ∈ [10^{-7}, 10^{-5}]. The Dual Trust Region requires two coefficients β_pos and β_vel (Equation 14) but their values are not explicitly stated in the main text. The gradient equalizer uses a median heuristic with small constant ϵ (Equation 9) but ϵ's value is unspecified. The moving anchor update interval N is mentioned but its value (e.g., every 100 steps) is not clearly stated in the main experimental setup. Without the code release, these missing details would block exact reproduction, though the appendix may contain them.
Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment. To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable. We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a precise manifold-aware SDE with a logarithmic curvature correction and introduce a gradient norm equalizer to stabilize sampling and updates across timesteps. At the macro level, we use a dual trust region with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift. We evaluate SAGE-GRPO on HunyuanVideo1.5 using the original VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality. The code and visual gallery are available at https://dungeonmassster.github.io/SAGE-GRPO-Page/.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.