Proximal Policy Optimization in Path Space: A Schr\"odinger Bridge Perspective

cs.LG Yuehu Gong, Zeyuan Wang, Yulin Chen, Yanwei Fu · Mar 23, 2026
Local to this browser
What it does
Generative policies represent actions as multi-step denoising trajectories, rendering standard PPO's single-step action-space ratios mismatched to the policy structure. This paper proposes GSB-PPO, a path-space formulation inspired by...
Why it matters
This paper proposes GSB-PPO, a path-space formulation inspired by Generalized Schrödinger Bridge that lifts proximal updates from terminal actions to full generation paths. The central finding is that a penalty-based objective...
Main concern
The paper offers a conceptually clean extension of PPO to generative policies through path-space objectives, backed by consistent empirical results across ten continuous control environments. The authors are commendably transparent that...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Generative policies represent actions as multi-step denoising trajectories, rendering standard PPO's single-step action-space ratios mismatched to the policy structure. This paper proposes GSB-PPO, a path-space formulation inspired by Generalized Schrödinger Bridge that lifts proximal updates from terminal actions to full generation paths. The central finding is that a penalty-based objective substantially outperforms the direct clipping extension, establishing trajectory-level regularization as the preferred inductive bias for on-policy generative RL.

Critical review
Verdict
Bottom line

The paper offers a conceptually clean extension of PPO to generative policies through path-space objectives, backed by consistent empirical results across ten continuous control environments. The authors are commendably transparent that their formulation is 'inspired by' rather than derived from GSB, and the stark performance gap between penalty and clipping variants provides a clear practical lesson. However, the deliberate restriction to comparisons with only standard PPO and FPO limits broader empirical validation, while the poor performance of the clipping variant raises questions about the general fragility of path-space likelihood ratios.

“Our formulation is inspired by the GSB perspective, but it is not a strict GSB objective: in particular, the penalty variant remains a PPO style proximal objective rather than an exact bridge matching formulation.”
Gong et al. (GSB-PPO) · Section 1
What holds up

The path-space lifting elegantly resolves the structural mismatch between PPO's single-step ratios and multi-step generative processes. The mathematical derivation showing equivalence between marginal and joint expectations (Eq. 14) justifies the trajectory-level formulation, while the concrete instantiation of two objectives—clipping and penalty—provides reproducible algorithmic details. The MSE-style penalty $\mathcal{R}_{\text{MSE}}(\theta,\theta_{\text{old}})=\mathbb{E}[\sum_{n=1}^N \frac{|\Delta t_n|}{2\sigma(t_n)^2} \|f_\theta(a^{(n)},t_n,s) - f_{\theta_{\text{old}}}(a^{(n)},t_n,s)\|_2^2]$ offers a tractable surrogate for path-space divergence.

“This is exactly the penalty used in our implementation.”
Gong et al. (GSB-PPO) · Section 4.3
“Equation (14) provides the bridge from standard PPO to our path space formulation.”
Gong et al. (GSB-PPO) · Section 4.1
Main concerns

The clipping variant fails empirically due to being 'sensitive to the accumulation of likelihood shifts across denoising steps' (Section 1), which suggests that the path-space ratio $r_\theta(s,a^{(0:N)}) = \prod_{n=1}^N \frac{p_\theta(a^{(n-1)}|a^{(n)},s)}{p_{\theta_{\text{old}}}(a^{(n-1)}|a^{(n)},s)}$ may be inherently unstable for multi-step trajectories. The experimental scope is explicitly narrow—the authors state 'the purpose of this experiment is not to provide an exhaustive benchmark against all existing generative RL methods' (Section 5.2)—avoiding comparisons with GenPO, DPPO, or other contemporaries. The GSB inspiration provides intuition but little theoretical grounding, as the paper admits the penalty objective is 'not a strict GSB objective' but rather a 'PPO style proximal objective'.

“While the clipping objective is the most direct extension of standard PPO, it is sensitive to the accumulation of likelihood shifts across denoising steps and performs poorly in our setting.”
Gong et al. (GSB-PPO) · Section 1
“We stress that the purpose of this experiment is not to provide an exhaustive benchmark against all existing generative RL methods.”
Gong et al. (GSB-PPO) · Section 5.2
Evidence and comparison

The evidence clearly supports the superiority of GSB-PPO-Penalty over GSB-PPO-Clip, with Figures 1 and 2 showing consistent performance gains across ten playground environments. The paper documents that 'the penalty formulation consistently delivers better stability and performance than the clipping counterpart' (Abstract). However, the comparison to prior work is limited: while standard PPO and FPO baselines are included, the paper explicitly avoids benchmarking against other on-policy generative methods like GenPO or DPPO, making it difficult to assess whether the penalty formulation represents a net improvement over the broader literature.

“Experimental results show that while both objectives are compatible with on-policy training, the penalty formulation consistently delivers better stability and performance than the clipping counterpart.”
Gong et al. (GSB-PPO) · Abstract
Reproducibility

The implementation builds upon the publicly available FPO playground codebase, and Appendix A provides detailed hyperparameters including learning rates ($3 \times 10^{-3}$), denoising steps (8), and the KL penalty coefficient ($\beta=0.1$). However, no public code repository or open-source release is mentioned in the manuscript, which may impede exact reproduction. Additionally, the necessity of 'step-level log-ratio clipping for numerical stabilization' (Section 5.1) suggests the optimization involves sensitive numerical dynamics that may not transfer cleanly without the specific implementation details.

“For numerical stability, our implementation applies an additional step-level clipping to the per-step log-ratio before forming the full path ratio.”
Gong et al. (GSB-PPO) · Section 5.1
“Table 1 summarizes the main hyperparameters used for PPO, FPO, and GSB-PPO-Penalty in our playground experiments.”
Gong et al. (GSB-PPO) · Appendix A
Abstract

On-policy reinforcement learning with generative policies is promising but remains underexplored. A central challenge is that proximal policy optimization (PPO) is traditionally formulated in terms of action-space probability ratios, whereas diffusion- and flow-based policies are more naturally represented as trajectory-level generative processes. In this work, we propose GSB-PPO, a path-space formulation of generative PPO inspired by the Generalized Schr\"odinger Bridge (GSB). Our framework lifts PPO-style proximal updates from terminal actions to full generation trajectories, yielding a unified view of on-policy optimization for generative policies. Within this framework, we develop two concrete objectives: a clipping-based objective, GSB-PPO-Clip, and a penalty-based objective, GSB-PPO-Penalty. Experimental results show that while both objectives are compatible with on-policy training, the penalty formulation consistently delivers better stability and performance than the clipping counterpart. Overall, our results highlight path-space proximal regularization as an effective principle for training generative policies with PPO.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.