On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation

cs.LG cs.AI Kexin Huang, Haoming Meng, Junkang Wu, Jinda Lu, Chiyu Ma, Ziqian Chen, Xue Wang, Bolin Ding, Jiancan Wu, Xiang Wang, Xiangnan He, Guoyin Wang, Jingren Zhou · Mar 23, 2026

What it does

Why it matters

The authors introduce $\Delta \log p$, the signed log-probability difference between base and RLVR models, and argue it better captures reasoning-critical tokens than magnitude-based metrics like entropy or KL divergence. They validate...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper investigates how Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning by focusing on the *direction* of policy updates rather than their magnitude. The authors introduce $\Delta \log p$, the signed log-probability difference between base and RLVR models, and argue it better captures reasoning-critical tokens than magnitude-based metrics like entropy or KL divergence. They validate this through token-replacement interventions and propose two practical applications: a test-time extrapolation method that amplifies the learned direction without additional training, and a training-time reweighting scheme that focuses learning on low-probability tokens.

Critical review

Verdict

Bottom line

The paper presents a compelling conceptual shift from magnitude-based to direction-based analysis of RLVR updates, supported by strong empirical evidence that $\Delta \log p$ more precisely identifies sparse, reasoning-critical tokens. The proposed test-time extrapolation and training-time reweighting methods demonstrate consistent gains across models and benchmarks. However, the theoretical justification relies on simplified tabular bandit assumptions that don't fully capture the complexities of policy gradient methods with clipped objectives, and the practical utility of the two-model extrapolation method is limited by computational costs.

“For a given prompt x, if a tabular softmax policy is updated via natural policy gradient, then the extrapolated policy satisfies: exists gamma>0, E[R] >= E[R]”

Huang et al., Sec. 4.1 · Theorem 4.1

“Delta log p-based replacement reaches the RLVR model's accuracy with the fewest substitutions (around 10% of tokens)”

Huang et al., Sec. 3.2 · Figure 2

What holds up

The token-level analysis is rigorous and well-executed. The statistical comparison clearly demonstrates that while entropy and KL divergence distributions overlap nearly completely between base and RLVR models, $\Delta \log p$ exhibits a distinct bimodal pattern that separates the two. The selective token replacement experiment provides strong causal evidence: replacing only 10% of tokens selected by $\Delta \log p$ recovers RLVR performance, whereas entropy requires significantly more replacements. The gradient analysis (Lemma 3.1) correctly derives that policy gradients concentrate on low-probability tokens via the $(1 - \pi_\theta(y_t))$ term, providing a theoretical foundation for the observed sparsity.

“the entropy and KL divergence distributions are nearly identical for both the base and RLVR model outputs. In contrast, the Delta log p distribution exhibits two distinct tails”

Huang et al., Sec. 3.1 · Section 3.1

“the ell-1-norm of the DAPO objective's gradient is given by: 2|w_{i,t}| * (1 - pi_theta(y_t))”

Huang et al., Sec. 3.3 · Lemma 3.1

Main concerns

The theoretical justification for extrapolation (Theorem 4.1) assumes an idealized tabular softmax bandit with Natural Policy Gradient updates, which abstracts away critical aspects of the actual training setup including: the clipping mechanism in PPO/GRPO/DAPO, the response-level reward structure, and the KL divergence regularization. The authors acknowledge this limitation but don't quantify how the mismatch affects the theoretical guarantees. Additionally, the test-time extrapolation method requires maintaining two full models (base and RLVR) and computing logits from both at each step, which incurs significant memory and compute overhead that limits practical deployment. The causal mechanism linking low-probability tokens to 'reasoning-critical' tokens remains somewhat correlational—while the replacement experiments show these tokens matter for performance, the paper doesn't fully establish *why* these specific tokens are reasoning-critical beyond their probability values.

“Nevertheless, we need to note that the proof relies on the idealized NPG's update rule... In contrast, our empirical analysis has shown that the updates learned by RLVR concentrate only on a minority of tokens”

Huang et al., Sec. 4.1 · Section 4.1

“One primary limitation of our extrapolation method is the requirement of two models”

Huang et al., Sec. 6 · Limitations

Evidence and comparison

The evidence supports the core claim that $\Delta \log p$ outperforms magnitude metrics for identifying sparse updates. The comparison to related reweighting methods (Table 3) shows their approach outperforms PPL-based reweighting (Deng et al., 2025) and Dominate (Yang et al., 2025b), though the Dominate baseline uses a more restrictive clip-high ratio (0.24 vs 0.28) which the authors note reduces exploration. The evaluation is somewhat narrow, focusing primarily on math reasoning (AIME-24, AMC) with limited generalization analysis (Minerva in Appendix C shows gains but with smaller effect sizes). The comparison to Yang et al. (2025b) is slightly asymmetrical—while Yang argues low-probability tokens 'over-dominate' and should be downweighted, this paper argues they should be upweighted, yet both report improvements over baselines, suggesting the optimal weighting may depend on training stability or model scale.

“Our method of directly amplifying low-probability tokens achieves the best overall performance for both Avg@32 and Pass@16”

Huang et al., Sec. 4.2 · Table 3

“they adopt a more restrictive clip-higher ratio of epsilon_high=0.24 than the default epsilon_high=0.28 in DAPO”

Huang et al., Sec. 4.2 · Footnote 5

Reproducibility

The paper provides adequate experimental detail for reproduction. The authors use publicly available models (Qwen2.5-Math-7B, Qwen3-8B-Base, DAPO-32B, ORZ-32B) and standard datasets (AIME-24, AIME-25, AMC). Training hyperparameters are detailed in Appendix B, including learning rates (1e-6), batch sizes (512 prompts, 16 responses each), and reweighting coefficients ($\alpha=0.2$ for Qwen2.5). Figure 9 shows reproducibility analysis with 4 independent training runs demonstrating consistent convergence. However, the code and exact training scripts are not explicitly released or linked in the provided text, which would be necessary for full reproducibility. The test-time extrapolation requires careful tuning of two hyperparameters ($\tau$ and $\gamma$) with grid search ranges provided in Appendix A.3.

“learning rate of 1e-6 with a 10-step warmup. Each RLVR step consists of 512 prompts with 16 sampled responses each”

Huang et al., Appendix B · Appendix B

“The learning curves across 4 independent runs on Qwen2.5-Math-7B with our reweighting method show consistent convergence and performance”

Huang et al., Appendix B · Figure 9

Abstract

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the \textbf{magnitude} of these updates, largely overlooking their \textbf{direction}. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference $\Delta\log p$ between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that $\Delta\log p$ more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a \textit{test-time extrapolation} method that amplifies the policy along the learned $\Delta\log p$ direction to improve reasoning accuracy without further training; (2) a \textit{training-time reweighting} method that focuses learning on low-probability (corresponding to higher $\Delta\log p$) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.