Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe

cs.LG cs.CL Xixi Wu, Qianguo Sun, Ruiyang Zhang, Chao Song, Junlong Wu, Yiyan Qi, Hong Cheng · Mar 23, 2026

What it does

Why it matters

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper tackles the challenge of scaling reinforcement learning for long-horizon tool-using agents, where LLMs must orchestrate dozens of tool calls to satisfy multifaceted constraints. The authors propose STAR, a post-training pipeline that decomposes the RL design space across five axes—reward shaping, model scaling, data composition, algorithm selection, and environmental stability—to derive a practical, scale-aware recipe for training.

Critical review

Verdict

Bottom line

The paper delivers a rigorous, systematic empirical study that successfully demystifies RL scaling for long-horizon agents, though its findings are bounded by simulation constraints. The distilled recipe—staged curriculum rewards for small models (1.5B), dense rewards for large models (7B), approximately 1K training samples, and standard GRPO for stronger base models—provides actionable guidelines for the community. However, the narrow OOD evaluation restricted to QA tasks and the absence of real-world API validation limit claims about broader generalization.

“While TravelPlanner is a challenging testbed, it remains simulated”

paper · Section 5

What holds up

The controlled ablation across five design dimensions is methodologically sound, with strict isolation of variables and transparent reporting of training dynamics. The discovery that reward requirements are scale-dependent represents a genuine contribution: smaller models benefit from staged curriculum rewards and exploration-heavy algorithms like ARPO, while larger models converge efficiently with dense rewards and vanilla GRPO. The identification of a data "sweet spot" at approximately 1K samples challenges the assumption that RL agents always benefit from massive datasets, showing instead that over-scaling to 2K degrades OOD generalization.

“The necessity for sophisticated exploration is inversely correlated with model capability”

paper · Section 4.5

“A sweet spot emerges at 1K prompts, balancing in-domain success with strong OOD generalization”

paper · Figure 5 caption

Main concerns

The study's reliance on TravelPlanner—a local sandbox with zero-cost, deterministic tool execution—raises significant questions about transfer to real-world APIs featuring latency, cost, and failure modes. The authors acknowledge this, noting that real-world scenarios expose agents to "diverse, unpredictable dynamics" not captured in simulation. Furthermore, the OOD evaluation is limited solely to knowledge-intensive QA benchmarks, leaving cross-domain robustness on complex agentic tasks unexplored. Table 2 also reveals an "alignment tax" where dense rewards maximize in-domain success but degrade general information-seeking abilities, yet the paper does not resolve how to optimize for both simultaneously.

“Limited OOD evaluation: Our OOD evaluation is currently restricted to the knowledge-intensive QA task”

paper · Section 5

“While the Sum reward maximizes in-domain performance for the 7B model, Table 2 reveals a severe alignment tax: its average OOD accuracy falls significantly behind the SFT checkpoint”

paper · Section 4.2

Evidence and comparison

The evidence robustly supports claims regarding scale-dependent algorithm selection and reward shaping, with Table 4 clearly demonstrating that at 1.5B, ARPO achieves 37.5% success versus GRPO's 30.1%, while at 7B, vanilla GRPO reaches 62.8%, outperforming both DAPO and ARPO. However, comparisons to proprietary LLMs (e.g., Kimi-K2.5, GPT-5) in Figure 1 conflate model capacity (1.5B-7B vs. 100B+) with training methodology, making it unclear whether gains stem from RL optimization or specialized domain training. The paper would benefit from comparing against similarly-sized models trained with supervised fine-tuning on equivalent data volumes to isolate the marginal contribution of RL.

“At the 1.5B scale, algorithmic interventions like ARPO and DAPO significantly outperform GRPO... Strikingly, at the 7B scale, GRPO achieves the highest success rate of 62.8%, outperforming both DAPO and ARPO”

paper · Section 4.5

Reproducibility

The authors provide strong reproducibility infrastructure, releasing code at https://github.com/WxxShirley/Agent-STAR and detailing hyperparameters in Appendix C, including learning rates ($2\times 10^{-6}$), batch sizes (32), and group sizes ($G=8$). They specify exact reward formulations: $r^{\text{sum}}=s_{\text{cs}}^{\text{micro}}+s_{\text{cs}}^{\text{macro}}+s_{\text{hard}}^{\text{micro}}+s_{\text{hard}}^{\text{macro}}+s^{\text{success}}$ and data compositions (4:3:3 easy:medium:hard ratio). However, reproduction is computationally demanding (368 GPU hours on 16×A100 for 7B models) and relies on proprietary formatting models (DeepSeek-V3.2-Exp) for parsing outputs, creating potential hidden dependencies on specific model versions and API availability.

“$r^{\text{sum}}=s_{\text{cs}}^{\text{micro}}+s_{\text{cs}}^{\text{macro}}+s_{\text{hard}}^{\text{micro}}+s_{\text{hard}}^{\text{macro}}+s^{\text{success}}$”

paper · Section 3

“GPU Hours are calculated based on 8×A100-80G for 1.5B and 3B models and 16×A100 for the 7B model”

paper · Table 4

Abstract

Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and out-of-domain performance, and (3) environmental stability is critical to prevent policy degradation. Based on our distilled recipe, our RL-trained models achieve state-of-the-art performance on TravelPlanner, significantly outperforming leading LLMs.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.