P^2O: Joint Policy and Prompt Optimization

cs.LG cs.AI Xinyu Lu, Kaiqi Zhang, Jinglin Yang, Boxi Cao, Yaojie Lu, Hongyu Lin, Min He, Xianpei Han, Le Sun · Mar 23, 2026

What it does

Why it matters

3% avg. ).

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

P^2O tackles a critical bottleneck in Reinforcement Learning with Verifiable Rewards (RLVR): hard samples with near-zero success rates yield vanishing gradients, effectively starving the model of supervision signals. The solution synergizes policy optimization with evolutionary prompt optimization (GEPA), using optimized prompts to discover successful trajectories for hard samples, then distilling these capabilities into model parameters via context distillation to avoid inference-time dependencies. Experiments on mathematical reasoning benchmarks demonstrate significant gains over GRPO baselines, particularly on challenging AIME problems (+12.3% avg.).

Critical review

Verdict

Bottom line

P^2O presents a technically sound framework that effectively addresses the exploration bottleneck in RLVR for reasoning tasks. The alternating maximization between policy updates and prompt evolution, coupled with context distillation to internalize prompt-induced capabilities, is well-motivated and empirically validated. The paper demonstrates consistent improvements over GRPO, though the significant performance gap between Teacher-Ref and Self-Ref variants (65.2% vs 62.4%) suggests the method's practical effectiveness may depend heavily on access to high-quality external reflection models.

“Specifically, P2O identifies challenging instances during training and employs the Genetic-Pareto (GEPA) prompt optimization algorithm to evolve prompts that elicit successful reasoning chains.”

Paper · Section 1

“On the DeepScaler-5K dataset, our best configuration achieves an average accuracy of 65.2%, surpassing the GRPO baseline by 4.7%.”

Paper · Section 4.2

What holds up

The formulation of the exploration bottleneck is rigorous: Equation 2 formalizes $\nabla_\theta J(x) \approx \mathbb{E}_{y\sim\pi_{\theta}}[\underbrace{(r(x,y)-b)}_{\approx 0}\nabla_{\theta}\log\pi_{\theta}(y|x)] \approx 0$ for hard samples, providing a clear mathematical justification for the approach. The context distillation mechanism is particularly elegant—it computes gradients on the original input $x$ while using trajectories generated from augmented inputs $\tilde{x}=\mathcal{T}(x,z)$, forcing internalization of reasoning patterns. The ablation study convincingly demonstrates that removing context distillation catastrophically degrades performance to 55.6%, falling below even the GRPO baseline (60.5%), proving that the distillation component is essential rather than merely auxiliary.

“By decoupling the rollout context ($\tilde{x}$) from the gradient context ($x$), P2O forces the model to learn the reasoning path $y$ as an intrinsic capability, independent of the auxiliary prompt $z$.”

Paper · Section 3.3

“the average accuracy drops significantly from 65.2% (P2O$_{\text{Teacher-Ref}}$) to 55.6%... falls behind the GRPO baseline (60.5%) by 4.9%”

Paper · Section 4.4

Main concerns

The computational cost of the GEPA phase is not fully characterized—Algorithm 3 reveals each iteration consumes $2B + |\mathcal{D}_\text{hard}^\text{dev}|$ evaluations per candidate template, which could be prohibitive when scaling to larger datasets or model sizes. The threshold $\tau$ for hard sample mining is described only as "typically nearly zero" without sensitivity analysis or theoretical guidance for setting this hyperparameter. Most critically, the paper lacks any theoretical analysis of convergence properties for the alternating maximization procedure, leaving open questions about whether the joint optimization guarantees monotonic improvement or risks oscillation between suboptimal prompt and policy configurations.

“Update Cost: $C_\text{left} \leftarrow C_\text{left} - 2B$... $C_\text{left} \leftarrow C_\text{left} - |\mathcal{D}_\text{hard}^\text{dev}|$”

Paper · Algorithm 3

“where $\tau$ is a threshold (typically nearly zero)”

Paper · Section 3.3

Evidence and comparison

The empirical evidence robustly supports the claim that P^2O outperforms GRPO on hard reasoning tasks, with particularly impressive gains on AIME24 (+12.9%) and AIME25 (+11.7%) using the DeepScaler-5K dataset. However, the comparison to related works is incomplete—while DAPO and outcome-based exploration methods (Song et al., 2025) are discussed in Section 5, they are notably absent from the experimental comparison tables. The inconsistent superiority of Teacher-Ref versus Self-Ref across datasets (Teacher-Ref wins on DeepScaler, Self-Ref wins on DeepMath) complicates the practical recommendation and suggests the method may require domain-specific tuning of the reflection model.

“P2O improves upon GRPO by 12.9% and 11.7% respectively”

Paper · Section 4.2

“on DeepScaler-5K, the Teacher-Reflection variant dominates (65.2% vs. 62.4% avg.), whereas on DeepMath-5K, P2O$_{\text{Self-Ref}}$ (61.7% avg.) surprisingly outperforms”

Paper · Section 4.2

Reproducibility

The paper provides detailed experimental configuration including hyperparameters (learning rate $1\times 10^{-6}$, global batch size 128, temperature $T=0.6$, $K=6$ rollouts), model specifications (Qwen3-4B), and dataset descriptions (DeepScaler-5K, DeepMath-5K). However, no code repository or open-source implementation is indicated. While Algorithms 1-4 provide pseudocode for the complete training procedure, the specific meta-prompt for "Propose Improvement" is referenced as adopted from Agrawal et al. (2025) rather than being explicitly stated. The exact sizes of the train/dev splits for hard samples ($|\mathcal{D}_\text{hard}^\text{train}|$, $|\mathcal{D}_\text{hard}^\text{dev}|$) and the total evolution budget $C_\text{total}$ are not specified, which could impede independent reproduction.

“The training hyperparameters include a maximum learning rate of $1\times 10^{-6}$, a global batch size of 128, and a maximum generation length of 12k tokens... temperature of $T=0.6$ and sample $K=6$ trajectories”

Paper · Section 4.1

“We adopt the same meta-prompt as in Agrawal et al. (2025)”

Paper · Section 3.4

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, vanilla RLVR suffers from inefficient exploration, particularly when confronting "hard samples" that yield nearzero success rates. In such scenarios, the reliance on sparse outcome rewards typically results in zero-advantage estimates, effectively starving the model of supervision signals despite the high informational value of these instances. To address this, we propose P^2O, a novel framework that synergizes Prompt Optimization with Policy Optimization. P^2O identifies hard samples during training iterations and leverages the GeneticPareto (GEPA) prompt optimization algorithm to evolve prompt templates that guide the model toward discovering successful trajectories. Crucially, unlike traditional prompt engineering methods that rely on input augmentation, P^2O distills the reasoning gains induced by these optimized prompts directly into the model parameters. This mechanism provides denser positive supervision signals for hard samples and accelerates convergence. Extensive experiments demonstrate that P^2O not only achieves superior performance on in-distribution datasets but also exhibits strong generalization, yielding substantial improvements on out-of-distribution benchmarks (+4.7% avg.).

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.