P^2O: Joint Policy and Prompt Optimization
P^2O tackles a critical bottleneck in Reinforcement Learning with Verifiable Rewards (RLVR): hard samples with near-zero success rates yield vanishing gradients, effectively starving the model of supervision signals. The solution synergizes policy optimization with evolutionary prompt optimization (GEPA), using optimized prompts to discover successful trajectories for hard samples, then distilling these capabilities into model parameters via context distillation to avoid inference-time dependencies. Experiments on mathematical reasoning benchmarks demonstrate significant gains over GRPO baselines, particularly on challenging AIME problems (+12.3% avg.).
P^2O presents a technically sound framework that effectively addresses the exploration bottleneck in RLVR for reasoning tasks. The alternating maximization between policy updates and prompt evolution, coupled with context distillation to internalize prompt-induced capabilities, is well-motivated and empirically validated. The paper demonstrates consistent improvements over GRPO, though the significant performance gap between Teacher-Ref and Self-Ref variants (65.2% vs 62.4%) suggests the method's practical effectiveness may depend heavily on access to high-quality external reflection models.
The formulation of the exploration bottleneck is rigorous: Equation 2 formalizes $\nabla_\theta J(x) \approx \mathbb{E}_{y\sim\pi_{\theta}}[\underbrace{(r(x,y)-b)}_{\approx 0}\nabla_{\theta}\log\pi_{\theta}(y|x)] \approx 0$ for hard samples, providing a clear mathematical justification for the approach. The context distillation mechanism is particularly elegant—it computes gradients on the original input $x$ while using trajectories generated from augmented inputs $\tilde{x}=\mathcal{T}(x,z)$, forcing internalization of reasoning patterns. The ablation study convincingly demonstrates that removing context distillation catastrophically degrades performance to 55.6%, falling below even the GRPO baseline (60.5%), proving that the distillation component is essential rather than merely auxiliary.
The computational cost of the GEPA phase is not fully characterized—Algorithm 3 reveals each iteration consumes $2B + |\mathcal{D}_\text{hard}^\text{dev}|$ evaluations per candidate template, which could be prohibitive when scaling to larger datasets or model sizes. The threshold $\tau$ for hard sample mining is described only as "typically nearly zero" without sensitivity analysis or theoretical guidance for setting this hyperparameter. Most critically, the paper lacks any theoretical analysis of convergence properties for the alternating maximization procedure, leaving open questions about whether the joint optimization guarantees monotonic improvement or risks oscillation between suboptimal prompt and policy configurations.
The empirical evidence robustly supports the claim that P^2O outperforms GRPO on hard reasoning tasks, with particularly impressive gains on AIME24 (+12.9%) and AIME25 (+11.7%) using the DeepScaler-5K dataset. However, the comparison to related works is incomplete—while DAPO and outcome-based exploration methods (Song et al., 2025) are discussed in Section 5, they are notably absent from the experimental comparison tables. The inconsistent superiority of Teacher-Ref versus Self-Ref across datasets (Teacher-Ref wins on DeepScaler, Self-Ref wins on DeepMath) complicates the practical recommendation and suggests the method may require domain-specific tuning of the reflection model.
The paper provides detailed experimental configuration including hyperparameters (learning rate $1\times 10^{-6}$, global batch size 128, temperature $T=0.6$, $K=6$ rollouts), model specifications (Qwen3-4B), and dataset descriptions (DeepScaler-5K, DeepMath-5K). However, no code repository or open-source implementation is indicated. While Algorithms 1-4 provide pseudocode for the complete training procedure, the specific meta-prompt for "Propose Improvement" is referenced as adopted from Agrawal et al. (2025) rather than being explicitly stated. The exact sizes of the train/dev splits for hard samples ($|\mathcal{D}_\text{hard}^\text{train}|$, $|\mathcal{D}_\text{hard}^\text{dev}|$) and the total evolution budget $C_\text{total}$ are not specified, which could impede independent reproduction.
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, vanilla RLVR suffers from inefficient exploration, particularly when confronting "hard samples" that yield nearzero success rates. In such scenarios, the reliance on sparse outcome rewards typically results in zero-advantage estimates, effectively starving the model of supervision signals despite the high informational value of these instances. To address this, we propose P^2O, a novel framework that synergizes Prompt Optimization with Policy Optimization. P^2O identifies hard samples during training iterations and leverages the GeneticPareto (GEPA) prompt optimization algorithm to evolve prompt templates that guide the model toward discovering successful trajectories. Crucially, unlike traditional prompt engineering methods that rely on input augmentation, P^2O distills the reasoning gains induced by these optimized prompts directly into the model parameters. This mechanism provides denser positive supervision signals for hard samples and accelerates convergence. Extensive experiments demonstrate that P^2O not only achieves superior performance on in-distribution datasets but also exhibits strong generalization, yielding substantial improvements on out-of-distribution benchmarks (+4.7% avg.).
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.