Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts

cs.LG cs.AI Andrei Baroian, Rutger Berger · Mar 22, 2026

What it does

Why it matters

5) to maximize gradient signal while staying on-policy by regenerating responses. By mixing replayed prompts with fresh samples and controlling reuse via cooldown steps and caps, the method aims to accelerate early training, though it...

Main concern

The paper presents a theoretically clean extension to GRPO that targets a genuine computational bottleneck. The core insight—that prompts with pass rate $p_\theta(x) \approx 0.

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

GRPO training for LLM reasoning suffers from expensive rollouts and wasted compute on zero-variance prompts where all answers are correct or wrong. This paper proposes Prompt Replay, an overhead-free online method that buffers and reuses medium-difficulty prompts (pass rate near 0.5) to maximize gradient signal while staying on-policy by regenerating responses. By mixing replayed prompts with fresh samples and controlling reuse via cooldown steps and caps, the method aims to accelerate early training, though it eventually plateaus to baseline performance.

Critical review

Verdict

Bottom line

The paper presents a theoretically clean extension to GRPO that targets a genuine computational bottleneck. The core insight—that prompts with pass rate $p_\theta(x) \approx 0.5$ maximize the variance term $p_\theta(x)(1-p_\theta(x))$ and thus gradient signal—is well-motivated. However, the empirical results are mixed: while the method achieves higher mean absolute advantage and reduces wasted rollouts, it ultimately "plateaus and converges with the baseline, as too aggressive configuration was used." The candid admission that hyperparameter tuning was incomplete and that Qwen2.5-Math exhibits spurious rewards which invalidate ablations tempers the practical impact.

“Prompt Replay... shows faster initial accuracy gains. Yet, it plateaus and converges with the baseline, as too aggressive configuration was used.”

paper · Section 5.1

What holds up

The theoretical justification linking pass rate to optimization efficiency is solid. The derivation establishes that $\mathbb{E}[\|\nabla_\theta J(x)\|^2] \propto p_\theta(x)(1-p_\theta(x))$, showing the function is "strictly concave over $p_\theta(x)\in[0,1]$ and symmetric around its global maximum at $p_\theta(x)=0.5$." The on-policy design—storing only prompts and regenerating responses—neatly avoids off-policy noise while capturing variance benefits. Validation across multiple model families (Llama-3.2-3B, Qwen3-8B) and datasets (Dolci, Polaris) consistently shows reduced zero-variance prompts and increased mean absolute advantage. The identification of spurious reward effects in Qwen2.5-Math is a valuable warning to the community.

“The function $v(x)=p_\theta(x)(1-p_\theta(x))$ is strictly concave over $p_\theta(x)\in[0,1]$ and symmetric around its global maximum at $p_\theta(x)=0.5$.”

paper · Section 4.2

“This can be explained by spurious rewards: Shao et al. (2025) explained the phenomenon, where Qwen 2.5 models, in particular, show an increase in performance even when giving random rewards.”

paper · Section 6.1

Main concerns

The primary issue is the plateauing behavior and lack of final performance gain. Despite faster wall-clock convergence initially, the method "eventually plateaus and converges with the baseline," suggesting it trades diversity for speed without improving asymptotic accuracy. The authors acknowledge the hyperparameter search was limited ("small experiments were run... the search space is large"), leaving open whether the plateau stems from fundamental limitations or poor configuration. The Qwen2.5-Math spurious reward phenomenon is deeply troubling: Figure 4 shows training on merely 32 static prompts matches the full dataset baseline, rendering ablations on this model meaningless and forcing their exclusion from main results. This raises questions about the robustness of other findings and whether similar pathologies exist in other models. Finally, the "zero additional overhead" claim ignores buffer maintenance costs, though these are likely minor compared to rollouts.

“Prompt Replay... shows: ... earlier gains in the average accuracy over 6 benchmarks, but plateaus and converges with the baseline.”

paper · Section 5.1

“The plateau of the prompt replay can be explained by two factors: (i) the hyperparameter configuration chosen might be too aggressive... An extensive hyperparameter optimization remains for future work.”

paper · Section 5.1

“Baseline OLMo-RL Qwen 2.5 1.5B on Dolci, training on full dataset vs 32 prompts... shockingly, it performs similarly or better than the baseline.”

paper · Figure 4 caption

Evidence and comparison

The evidence strongly supports intermediate efficiency metrics (reduced zero-variance prompts, higher advantage) but fails to demonstrate sustained performance improvements over the baseline. The comparison to related work distinguishes Prompt Replay from trajectory-replay methods (which introduce off-policy noise) and static offline filtering like LIMR. However, the paper lacks direct comparisons to recent online filtering methods such as GRESO or Dr. GRPO in the main results, making relative gains hard to assess. The observation that benefits diminish when rollouts are not the bottleneck (Polaris with doubled GPUs) appropriately limits the claimed applicability to scenarios where generation dominates training time.

“Recently, GRESO (Zheng et al., 2025) proposed a method to overcome this measurement tax... However, the authors leave sorting prompts on learnability for future research.”

paper · Section 1

“On Qwen3-8B Polaris, the longer context forced us to double rollout GPUs... so rollouts stopped being the bottleneck and Prompt Replay's compute savings from fewer zero-variance prompts didn't translate into faster training.”

paper · Section 5.1

Reproducibility

The paper builds on the open-source OLMo-RL codebase and provides detailed hyperparameters in Appendix C (batch size 32, 16 rollouts per prompt, learning rate $1.0\times 10^{-6}$), which aids reproduction. Algorithm 1 in Appendix B gives pseudocode for the replay mechanism. However, the actual implementation code is not released or linked. The authors explicitly state that "multiple runs with different seeds were not performed to get statistical significance of the results, an unhealthy practice common in the literature," which limits confidence in the plateauing claims. Computational constraints also prevented thorough hyperparameter sweeps or scaling experiments to industry-scale budgets (3k GPU hours vs 100k in related work), leaving questions about robustness at scale unanswered.

“Computing is also the reason multiple runs with different seeds were not performed to get statistical significance of the results, an unhealthy practice common in the literature.”

paper · Section 6.2

“Learning rate: $1.0\times 10^{-6}$, Batch size: 32, Rollouts per Batch: 16, Prompt replay fraction: 0.75.”

paper · Appendix C

Abstract

Reinforcement learning with verifiable rewards (RLVR) plays a crucial role in expanding the capacities of LLM reasoning, but GRPO-style training is dominated by expensive rollouts and wastes compute on unusable prompts. We propose Prompt Replay, an overhead-free online data selection method for GRPO that reuses prompts only (not trajectories), to preserve on-policy optimization. After each step, we insert prompts with medium difficulty into a buffer, and prioritize prompts closer to a pass rate of 0.5 (half answers correct, half wrong) to maximize the advantage, thus learning signal. Training batches are formed by mixing reused prompts with fresh samples, with cooldown steps and max reuse times controlling aggressiveness vs risk of overfitting. Across multiple model families (Llama-3.2- 3B, Qwen3-8B) and training datasets (Dolci, Polaris), evaluated using average accuracy on six standard math benchmarks, Prompt Replay reduces zero-variance prompts, increases mean absolute advantage and shows faster initial accuracy gains. Yet, it plateaus and converges with the baseline, as too aggressive configuration was used. The method is most efficient when the rollouts are the primary bottleneck and the dataset is difficult for the model. We additionally observe that Qwen2.5-Math can exhibit spurious-reward effects that invalidates ablations, raising a warning signal for using it as a sole testbed for GRPO method research.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.