AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
AgentHER tackles the data waste problem in LLM agent training by adapting Hindsight Experience Replay (HER) from RL to natural-language trajectories. The core insight is that failed trajectories—typically 60–75% of collected data—often represent valid demonstrations for achievable alternative goals. The paper proposes a four-stage pipeline with multi-judge verification that converts discarded failures into SFT and DPO training data, yielding +7.1–11.7 pp gains over success-only fine-tuning across four model families on WebArena and ToolBench.
The paper presents a well-motivated and empirically validated approach to data augmentation for LLM agents. The adaptation of HER to the language domain is novel and practical, with strong results showing consistent gains across model scales (1.5B to 72B parameters) and benchmarks. The multi-judge verification and severity weighting mechanisms effectively reduce label noise from 5.9% to 2.3%. However, the theoretical guarantee (Proposition 3.1) assumes a perfect judge, and the WebArena evaluation protocol involves task-set leakage that may inflate absolute numbers, though the authors acknowledge these limitations and provide mitigating controls.
The empirical gains are robust and consistent across model scales and benchmarks. The ablation studies rigorously validate each architectural decision: multi-judge verification improves precision, severity weighting ($w_i \in [0.3, 1.0]$) adds value over uniform weighting, and confidence filtering with threshold $\theta = 0.5$ is critical (removing it causes −4.1 pp degradation). The cross-benchmark transfer experiment (+9.5 pp on ToolBench when trained on WebArena) provides strong evidence that the method learns generalizable behaviors rather than task-specific memorization.
The primary limitation is task-set leakage in WebArena: training and evaluation use identical task environments, exposing models to HTML structures during training. While mitigated by SFT-Random controls and transfer experiments, a held-out partition would strengthen claims. The data volume asymmetry (AgentHER uses 3,000 failures vs. 500–2,000 successes for SFT-Success) complicates comparisons, though SFT-Random underperforms by ~9 pp. The theoretical analysis relies on an unrealistic 'perfect judge' assumption, and the bound calculation in Remark 1 uses an in-sample proxy rather than strict grounding. Additionally, the 38.7% estimated rate of valid pairs among filtered-out borderline cases suggests the confidence threshold $\theta = 0.5$ may be overly conservative.
The evidence supports the claims within experimental constraints. Comparisons to baselines are rigorous: SFT-Success represents standard practice, Rejection-Sampling isolates filtering versus relabeling value, and SFT-Random controls for data volume. The distinction from concurrent work ECHO (Hu et al., 2025) is clearly articulated—AgentHER performs offline goal relabeling for training, while ECHO optimizes inference-time memory. The per-failure-type analysis shows logical variation: Incomplete and Constraint_Violation yield +11.2 and +9.8 pp respectively, while Tool_Error yields only +2.1 pp, confirming that richer partial trajectories provide more valuable training signal than crashes.
Reproducibility is reasonably strong. The paper provides detailed hyperparameters (LoRA rank $r=16$, $\alpha=32$, learning rates, batch sizes), full prompt templates for all four stages, and code release. Training uses standard configurations with reported random seeds (42, 1234, 2025) and low variance (std <<0.5 pp). However, reliance on proprietary GPT-4o/4o-mini for relabeling limits full reproduction, though rule-based alternatives are provided for Stages 1–2. Missing details include exact API costs/latency for multi-judge scaling and the precise keyword lexicons for rule-based failure detection.
LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely discarded, wasting the dominant source of collected experience. We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation. The key insight is simple: a trajectory that fails goal A is often a correct demonstration for some achievable alternative goal B. AgentHER realises this idea through a four-stage pipeline -- failure classification, outcome extraction, LLM-guided prompt relabeling with confidence gating, and data packaging -- that converts discarded failures into high-quality SFT, DPO, and ShareGPT training data, with both zero-cost rule-based and LLM-judge implementations. On WebArena (Zhou et al., 2024) and ToolBench (Qin et al., 2024), AgentHER improves over success-only SFT by +7.1-11.7 pp across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), while achieving 2x data efficiency -- matching baseline performance with only 50% of successful demonstrations. Gains are consistent from 1.5B to 72B parameters (+5.8-9.2 pp) and compound under iterative redeployment (+2.1 pp over additional rounds). Human evaluation confirms 97.7% relabeling precision under multi-judge verification.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.