AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

cs.AI cs.CL Liang Ding · Mar 22, 2026
Local to this browser
What it does
AgentHER tackles the data waste problem in LLM agent training by adapting Hindsight Experience Replay (HER) from RL to natural-language trajectories. The core insight is that failed trajectories—typically 60–75% of collected data—often...
Why it matters
1–11. 7 pp gains over success-only fine-tuning across four model families on WebArena and ToolBench.
Main concern
The paper presents a well-motivated and empirically validated approach to data augmentation for LLM agents. The adaptation of HER to the language domain is novel and practical, with strong results showing consistent gains across model...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

AgentHER tackles the data waste problem in LLM agent training by adapting Hindsight Experience Replay (HER) from RL to natural-language trajectories. The core insight is that failed trajectories—typically 60–75% of collected data—often represent valid demonstrations for achievable alternative goals. The paper proposes a four-stage pipeline with multi-judge verification that converts discarded failures into SFT and DPO training data, yielding +7.1–11.7 pp gains over success-only fine-tuning across four model families on WebArena and ToolBench.

Critical review
Verdict
Bottom line

The paper presents a well-motivated and empirically validated approach to data augmentation for LLM agents. The adaptation of HER to the language domain is novel and practical, with strong results showing consistent gains across model scales (1.5B to 72B parameters) and benchmarks. The multi-judge verification and severity weighting mechanisms effectively reduce label noise from 5.9% to 2.3%. However, the theoretical guarantee (Proposition 3.1) assumes a perfect judge, and the WebArena evaluation protocol involves task-set leakage that may inflate absolute numbers, though the authors acknowledge these limitations and provide mitigating controls.

“The current protocol uses the same 812 WebArena tasks for both failure collection and evaluation... This constitutes a form of task-set leakage that may inflate WebArena numbers relative to a truly held-out test partition.”
paper · Section 6
“The theoretical guarantee (Proposition 3.1) assumes a perfect judge; tightening the bound under a noisy-oracle model would close the theory–experiment gap.”
paper · Section 6
What holds up

The empirical gains are robust and consistent across model scales and benchmarks. The ablation studies rigorously validate each architectural decision: multi-judge verification improves precision, severity weighting ($w_i \in [0.3, 1.0]$) adds value over uniform weighting, and confidence filtering with threshold $\theta = 0.5$ is critical (removing it causes −4.1 pp degradation). The cross-benchmark transfer experiment (+9.5 pp on ToolBench when trained on WebArena) provides strong evidence that the method learns generalizable behaviors rather than task-specific memorization.

“Multi-judge reduces noise from 5.9% to 2.3% (−0.8 pp accuracy cost), confirming that the precision gain translates to downstream quality.”
paper · Section 4.5
“AgentHER-MJ achieves a +9.5 pp transfer advantage over SFT-Success when evaluated zero-shot on ToolBench, a completely different benchmark... demonstrating that the model has clearly learned broadly applicable planning and tool-use behaviours rather than rote task patterns.”
paper · Section 4.2
Main concerns

The primary limitation is task-set leakage in WebArena: training and evaluation use identical task environments, exposing models to HTML structures during training. While mitigated by SFT-Random controls and transfer experiments, a held-out partition would strengthen claims. The data volume asymmetry (AgentHER uses 3,000 failures vs. 500–2,000 successes for SFT-Success) complicates comparisons, though SFT-Random underperforms by ~9 pp. The theoretical analysis relies on an unrealistic 'perfect judge' assumption, and the bound calculation in Remark 1 uses an in-sample proxy rather than strict grounding. Additionally, the 38.7% estimated rate of valid pairs among filtered-out borderline cases suggests the confidence threshold $\theta = 0.5$ may be overly conservative.

“The current protocol uses the same 812 WebArena tasks for both failure collection and evaluation... fine-tuned models have been exposed to the HTML structure, API response patterns, and navigation conventions of those 812 pages during training.”
paper · Section 6
“Among the 31 pairs filtered out in this 200-pair sample, 38.7% were rated valid by annotators, confirming the confidence filter errs on the side of caution.”
paper · Section 5.2
Evidence and comparison

The evidence supports the claims within experimental constraints. Comparisons to baselines are rigorous: SFT-Success represents standard practice, Rejection-Sampling isolates filtering versus relabeling value, and SFT-Random controls for data volume. The distinction from concurrent work ECHO (Hu et al., 2025) is clearly articulated—AgentHER performs offline goal relabeling for training, while ECHO optimizes inference-time memory. The per-failure-type analysis shows logical variation: Incomplete and Constraint_Violation yield +11.2 and +9.8 pp respectively, while Tool_Error yields only +2.1 pp, confirming that richer partial trajectories provide more valuable training signal than crashes.

“AgentHER differs in targeting offline training data augmentation: we relabel only the goal (user prompt) while keeping the trajectory unchanged, and output SFT/DPO datasets for fine-tuning rather than inference-time memory; the two approaches are complementary.”
paper · Section 2
“Incomplete (+11.2 pp) and Constraint_Violation (+9.8 pp) benefit the most... Tool_Error yields the least (+2.1 pp) as crashes leave minimal usable signal.”
paper · Section 5.1
Reproducibility

Reproducibility is reasonably strong. The paper provides detailed hyperparameters (LoRA rank $r=16$, $\alpha=32$, learning rates, batch sizes), full prompt templates for all four stages, and code release. Training uses standard configurations with reported random seeds (42, 1234, 2025) and low variance (std <<0.5 pp). However, reliance on proprietary GPT-4o/4o-mini for relabeling limits full reproduction, though rule-based alternatives are provided for Stages 1–2. Missing details include exact API costs/latency for multi-judge scaling and the precise keyword lexicons for rule-based failure detection.

“LoRA rank $r$: 16; LoRA $\alpha$: 32; Learning rate: $2\times 10^{-4}$ (SFT), $5\times 10^{-5}$ (DPO); Epochs: 3 (SFT), 1 (DPO).”
paper · Appendix A
“All fine-tuned models... are trained with 3 independent random seeds (42, 1234, 2025)... All standard deviations are below 0.5 pp.”
paper · Appendix G
Abstract

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely discarded, wasting the dominant source of collected experience. We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation. The key insight is simple: a trajectory that fails goal A is often a correct demonstration for some achievable alternative goal B. AgentHER realises this idea through a four-stage pipeline -- failure classification, outcome extraction, LLM-guided prompt relabeling with confidence gating, and data packaging -- that converts discarded failures into high-quality SFT, DPO, and ShareGPT training data, with both zero-cost rule-based and LLM-judge implementations. On WebArena (Zhou et al., 2024) and ToolBench (Qin et al., 2024), AgentHER improves over success-only SFT by +7.1-11.7 pp across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), while achieving 2x data efficiency -- matching baseline performance with only 50% of successful demonstrations. Gains are consistent from 1.5B to 72B parameters (+5.8-9.2 pp) and compound under iterative redeployment (+2.1 pp over additional rounds). Human evaluation confirms 97.7% relabeling precision under multi-judge verification.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.