On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation
This paper investigates how Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning by focusing on the *direction* of policy updates rather than their magnitude. The authors introduce $\Delta \log p$, the signed log-probability difference between base and RLVR models, and argue it better captures reasoning-critical tokens than magnitude-based metrics like entropy or KL divergence. They validate this through token-replacement interventions and propose two practical applications: a test-time extrapolation method that amplifies the learned direction without additional training, and a training-time reweighting scheme that focuses learning on low-probability tokens.
The paper presents a compelling conceptual shift from magnitude-based to direction-based analysis of RLVR updates, supported by strong empirical evidence that $\Delta \log p$ more precisely identifies sparse, reasoning-critical tokens. The proposed test-time extrapolation and training-time reweighting methods demonstrate consistent gains across models and benchmarks. However, the theoretical justification relies on simplified tabular bandit assumptions that don't fully capture the complexities of policy gradient methods with clipped objectives, and the practical utility of the two-model extrapolation method is limited by computational costs.
The token-level analysis is rigorous and well-executed. The statistical comparison clearly demonstrates that while entropy and KL divergence distributions overlap nearly completely between base and RLVR models, $\Delta \log p$ exhibits a distinct bimodal pattern that separates the two. The selective token replacement experiment provides strong causal evidence: replacing only 10% of tokens selected by $\Delta \log p$ recovers RLVR performance, whereas entropy requires significantly more replacements. The gradient analysis (Lemma 3.1) correctly derives that policy gradients concentrate on low-probability tokens via the $(1 - \pi_\theta(y_t))$ term, providing a theoretical foundation for the observed sparsity.
The theoretical justification for extrapolation (Theorem 4.1) assumes an idealized tabular softmax bandit with Natural Policy Gradient updates, which abstracts away critical aspects of the actual training setup including: the clipping mechanism in PPO/GRPO/DAPO, the response-level reward structure, and the KL divergence regularization. The authors acknowledge this limitation but don't quantify how the mismatch affects the theoretical guarantees. Additionally, the test-time extrapolation method requires maintaining two full models (base and RLVR) and computing logits from both at each step, which incurs significant memory and compute overhead that limits practical deployment. The causal mechanism linking low-probability tokens to 'reasoning-critical' tokens remains somewhat correlational—while the replacement experiments show these tokens matter for performance, the paper doesn't fully establish *why* these specific tokens are reasoning-critical beyond their probability values.
The evidence supports the core claim that $\Delta \log p$ outperforms magnitude metrics for identifying sparse updates. The comparison to related reweighting methods (Table 3) shows their approach outperforms PPL-based reweighting (Deng et al., 2025) and Dominate (Yang et al., 2025b), though the Dominate baseline uses a more restrictive clip-high ratio (0.24 vs 0.28) which the authors note reduces exploration. The evaluation is somewhat narrow, focusing primarily on math reasoning (AIME-24, AMC) with limited generalization analysis (Minerva in Appendix C shows gains but with smaller effect sizes). The comparison to Yang et al. (2025b) is slightly asymmetrical—while Yang argues low-probability tokens 'over-dominate' and should be downweighted, this paper argues they should be upweighted, yet both report improvements over baselines, suggesting the optimal weighting may depend on training stability or model scale.
The paper provides adequate experimental detail for reproduction. The authors use publicly available models (Qwen2.5-Math-7B, Qwen3-8B-Base, DAPO-32B, ORZ-32B) and standard datasets (AIME-24, AIME-25, AMC). Training hyperparameters are detailed in Appendix B, including learning rates (1e-6), batch sizes (512 prompts, 16 responses each), and reweighting coefficients ($\alpha=0.2$ for Qwen2.5). Figure 9 shows reproducibility analysis with 4 independent training runs demonstrating consistent convergence. However, the code and exact training scripts are not explicitly released or linked in the provided text, which would be necessary for full reproducibility. The test-time extrapolation requires careful tuning of two hyperparameters ($\tau$ and $\gamma$) with grid search ranges provided in Appendix A.3.
Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the \textbf{magnitude} of these updates, largely overlooking their \textbf{direction}. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference $\Delta\log p$ between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that $\Delta\log p$ more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a \textit{test-time extrapolation} method that amplifies the policy along the learned $\Delta\log p$ direction to improve reasoning accuracy without further training; (2) a \textit{training-time reweighting} method that focuses learning on low-probability (corresponding to higher $\Delta\log p$) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.