Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe
This paper tackles the challenge of scaling reinforcement learning for long-horizon tool-using agents, where LLMs must orchestrate dozens of tool calls to satisfy multifaceted constraints. The authors propose STAR, a post-training pipeline that decomposes the RL design space across five axes—reward shaping, model scaling, data composition, algorithm selection, and environmental stability—to derive a practical, scale-aware recipe for training.
The paper delivers a rigorous, systematic empirical study that successfully demystifies RL scaling for long-horizon agents, though its findings are bounded by simulation constraints. The distilled recipe—staged curriculum rewards for small models (1.5B), dense rewards for large models (7B), approximately 1K training samples, and standard GRPO for stronger base models—provides actionable guidelines for the community. However, the narrow OOD evaluation restricted to QA tasks and the absence of real-world API validation limit claims about broader generalization.
The controlled ablation across five design dimensions is methodologically sound, with strict isolation of variables and transparent reporting of training dynamics. The discovery that reward requirements are scale-dependent represents a genuine contribution: smaller models benefit from staged curriculum rewards and exploration-heavy algorithms like ARPO, while larger models converge efficiently with dense rewards and vanilla GRPO. The identification of a data "sweet spot" at approximately 1K samples challenges the assumption that RL agents always benefit from massive datasets, showing instead that over-scaling to 2K degrades OOD generalization.
The study's reliance on TravelPlanner—a local sandbox with zero-cost, deterministic tool execution—raises significant questions about transfer to real-world APIs featuring latency, cost, and failure modes. The authors acknowledge this, noting that real-world scenarios expose agents to "diverse, unpredictable dynamics" not captured in simulation. Furthermore, the OOD evaluation is limited solely to knowledge-intensive QA benchmarks, leaving cross-domain robustness on complex agentic tasks unexplored. Table 2 also reveals an "alignment tax" where dense rewards maximize in-domain success but degrade general information-seeking abilities, yet the paper does not resolve how to optimize for both simultaneously.
The evidence robustly supports claims regarding scale-dependent algorithm selection and reward shaping, with Table 4 clearly demonstrating that at 1.5B, ARPO achieves 37.5% success versus GRPO's 30.1%, while at 7B, vanilla GRPO reaches 62.8%, outperforming both DAPO and ARPO. However, comparisons to proprietary LLMs (e.g., Kimi-K2.5, GPT-5) in Figure 1 conflate model capacity (1.5B-7B vs. 100B+) with training methodology, making it unclear whether gains stem from RL optimization or specialized domain training. The paper would benefit from comparing against similarly-sized models trained with supervised fine-tuning on equivalent data volumes to isolate the marginal contribution of RL.
The authors provide strong reproducibility infrastructure, releasing code at https://github.com/WxxShirley/Agent-STAR and detailing hyperparameters in Appendix C, including learning rates ($2\times 10^{-6}$), batch sizes (32), and group sizes ($G=8$). They specify exact reward formulations: $r^{\text{sum}}=s_{\text{cs}}^{\text{micro}}+s_{\text{cs}}^{\text{macro}}+s_{\text{hard}}^{\text{micro}}+s_{\text{hard}}^{\text{macro}}+s^{\text{success}}$ and data compositions (4:3:3 easy:medium:hard ratio). However, reproduction is computationally demanding (368 GPU hours on 16×A100 for 7B models) and relies on proprietary formatting models (DeepSeek-V3.2-Exp) for parsing outputs, creating potential hidden dependencies on specific model versions and API availability.
Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and out-of-domain performance, and (3) environmental stability is critical to prevent policy degradation. Based on our distilled recipe, our RL-trained models achieve state-of-the-art performance on TravelPlanner, significantly outperforming leading LLMs.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.