Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts
GRPO training for LLM reasoning suffers from expensive rollouts and wasted compute on zero-variance prompts where all answers are correct or wrong. This paper proposes Prompt Replay, an overhead-free online method that buffers and reuses medium-difficulty prompts (pass rate near 0.5) to maximize gradient signal while staying on-policy by regenerating responses. By mixing replayed prompts with fresh samples and controlling reuse via cooldown steps and caps, the method aims to accelerate early training, though it eventually plateaus to baseline performance.
The paper presents a theoretically clean extension to GRPO that targets a genuine computational bottleneck. The core insight—that prompts with pass rate $p_\theta(x) \approx 0.5$ maximize the variance term $p_\theta(x)(1-p_\theta(x))$ and thus gradient signal—is well-motivated. However, the empirical results are mixed: while the method achieves higher mean absolute advantage and reduces wasted rollouts, it ultimately "plateaus and converges with the baseline, as too aggressive configuration was used." The candid admission that hyperparameter tuning was incomplete and that Qwen2.5-Math exhibits spurious rewards which invalidate ablations tempers the practical impact.
The theoretical justification linking pass rate to optimization efficiency is solid. The derivation establishes that $\mathbb{E}[\|\nabla_\theta J(x)\|^2] \propto p_\theta(x)(1-p_\theta(x))$, showing the function is "strictly concave over $p_\theta(x)\in[0,1]$ and symmetric around its global maximum at $p_\theta(x)=0.5$." The on-policy design—storing only prompts and regenerating responses—neatly avoids off-policy noise while capturing variance benefits. Validation across multiple model families (Llama-3.2-3B, Qwen3-8B) and datasets (Dolci, Polaris) consistently shows reduced zero-variance prompts and increased mean absolute advantage. The identification of spurious reward effects in Qwen2.5-Math is a valuable warning to the community.
The primary issue is the plateauing behavior and lack of final performance gain. Despite faster wall-clock convergence initially, the method "eventually plateaus and converges with the baseline," suggesting it trades diversity for speed without improving asymptotic accuracy. The authors acknowledge the hyperparameter search was limited ("small experiments were run... the search space is large"), leaving open whether the plateau stems from fundamental limitations or poor configuration. The Qwen2.5-Math spurious reward phenomenon is deeply troubling: Figure 4 shows training on merely 32 static prompts matches the full dataset baseline, rendering ablations on this model meaningless and forcing their exclusion from main results. This raises questions about the robustness of other findings and whether similar pathologies exist in other models. Finally, the "zero additional overhead" claim ignores buffer maintenance costs, though these are likely minor compared to rollouts.
The evidence strongly supports intermediate efficiency metrics (reduced zero-variance prompts, higher advantage) but fails to demonstrate sustained performance improvements over the baseline. The comparison to related work distinguishes Prompt Replay from trajectory-replay methods (which introduce off-policy noise) and static offline filtering like LIMR. However, the paper lacks direct comparisons to recent online filtering methods such as GRESO or Dr. GRPO in the main results, making relative gains hard to assess. The observation that benefits diminish when rollouts are not the bottleneck (Polaris with doubled GPUs) appropriately limits the claimed applicability to scenarios where generation dominates training time.
The paper builds on the open-source OLMo-RL codebase and provides detailed hyperparameters in Appendix C (batch size 32, 16 rollouts per prompt, learning rate $1.0\times 10^{-6}$), which aids reproduction. Algorithm 1 in Appendix B gives pseudocode for the replay mechanism. However, the actual implementation code is not released or linked. The authors explicitly state that "multiple runs with different seeds were not performed to get statistical significance of the results, an unhealthy practice common in the literature," which limits confidence in the plateauing claims. Computational constraints also prevented thorough hyperparameter sweeps or scaling experiments to industry-scale budgets (3k GPU hours vs 100k in related work), leaving questions about robustness at scale unanswered.
Reinforcement learning with verifiable rewards (RLVR) plays a crucial role in expanding the capacities of LLM reasoning, but GRPO-style training is dominated by expensive rollouts and wastes compute on unusable prompts. We propose Prompt Replay, an overhead-free online data selection method for GRPO that reuses prompts only (not trajectories), to preserve on-policy optimization. After each step, we insert prompts with medium difficulty into a buffer, and prioritize prompts closer to a pass rate of 0.5 (half answers correct, half wrong) to maximize the advantage, thus learning signal. Training batches are formed by mixing reused prompts with fresh samples, with cooldown steps and max reuse times controlling aggressiveness vs risk of overfitting. Across multiple model families (Llama-3.2- 3B, Qwen3-8B) and training datasets (Dolci, Polaris), evaluated using average accuracy on six standard math benchmarks, Prompt Replay reduces zero-variance prompts, increases mean absolute advantage and shows faster initial accuracy gains. Yet, it plateaus and converges with the baseline, as too aggressive configuration was used. The method is most efficient when the rollouts are the primary bottleneck and the dataset is difficult for the model. We additionally observe that Qwen2.5-Math can exhibit spurious-reward effects that invalidates ablations, raising a warning signal for using it as a sole testbed for GRPO method research.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.