Adaptive Robust Estimator for Multi-Agent Reinforcement Learning

cs.AI Zhongyi Li, Wan Tian, Jingyu Chen, Kangyao Huang, Huiming Zhang, Hui Yang, Tao Ren, Jinyang Jiang, Yijie Peng, Yikun Ban, Fuzhen Zhuang · Mar 23, 2026
Local to this browser
What it does
The paper tackles instability in multi-agent reinforcement learning for LLM reasoning, where noisy, heavy-tailed rewards break standard GRPO batch-mean normalization. It proposes DACR, a structured Answer-Critique-Rewrite protocol with...
Why it matters
It proposes DACR, a structured Answer-Critique-Rewrite protocol with cross-improvement rewards, and ARE, a robust estimator that replaces empirical means with a Median-of-Means variant using adaptive losses. Experiments on mathematical...
Main concern
The paper offers a coherent solution to GRPO's fragility under heavy-tailed rewards, combining structured multi-agent interaction (DACR) with rigorous robust statistics (ARE). The theoretical guarantees—consistency under finite-variance...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

The paper tackles instability in multi-agent reinforcement learning for LLM reasoning, where noisy, heavy-tailed rewards break standard GRPO batch-mean normalization. It proposes DACR, a structured Answer-Critique-Rewrite protocol with cross-improvement rewards, and ARE, a robust estimator that replaces empirical means with a Median-of-Means variant using adaptive losses. Experiments on mathematical reasoning and aerial vision-language navigation demonstrate improved accuracy and training stability under synthetic noise contamination.

Critical review
Verdict
Bottom line

The paper offers a coherent solution to GRPO's fragility under heavy-tailed rewards, combining structured multi-agent interaction (DACR) with rigorous robust statistics (ARE). The theoretical guarantees—consistency under finite-variance (Theorem 5.3) and heavy-tailed $(1+\epsilon)$-moment assumptions (Theorem 5.5)—are solid, and empirical gains across model scales (Table 1) support the claims. However, the reliance on synthetic Cauchy noise injection rather than naturally heavy-tailed distributions limits ecological validity, and the VLN evaluation uses a smaller backbone (Qwen2-VL-2B) than the FlightGPT baseline, confounding direct comparison.

“we inject heavy-tailed, peaked Cauchy noise into the reward model”
Li et al., Sec. 6.1 · Section 6.1 Q1
“due to current constraints, we focused on validating the Adaptive Robust Estimator component... we used a smaller backbone than in (Cai et al., 2025)”
Li et al., Sec. 6.2 · Section 6.2
What holds up

The DACR protocol's cross-improvement reward $\Delta_i = R_{\text{rw},-i} - R_{\text{ans},-i}$ provides a principled mechanism for attributing credit when one agent's critique improves its partner's answer. ARE's two-layer robustness—using graduated nonconvexity (GNC) within blocks and median aggregation across blocks—is theoretically sound, achieving $\sqrt{n}$-consistency with efficiency $2/\pi$ in the finite-variance regime (Theorem 5.6) and sub-Gaussian deviations under only $(1+\epsilon)$ moments (Theorem 5.7). The ablation in Figure 4 validates that the three-stage interaction itself provides gains (e.g., +12.5% on AMC23 for Qwen2.5-7B), while Figure 5 demonstrates ARE's resilience to both group-level and within-group contamination.

“$\Delta_{1}=R_{\mathrm{rw},2}-R_{\mathrm{ans},2},\qquad\Delta_{2}=R_{\mathrm{rw},1}-R_{\mathrm{ans},1}$”
Li et al., Eq. 2 · Section 2.2
“$\sqrt{n}\,(\widetilde{\mu}-\mu)\ \Rightarrow\ N\!\left(0,\frac{\pi}{2}\sigma^{2}\right)$”
Li et al., Thm 5.6 · Theorem 5.6
Main concerns

The experimental validation of heavy-tailed robustness relies entirely on synthetic noise injection (Cauchy contamination) rather than naturally arising distribution tails, which may not capture the complex dependency structures of real reward model errors. While the ARE estimator is theoretically appealing, its practical implementation requires solving nonconvex subproblems via GNC-IRLS (Appendix E), yet the computational overhead relative to standard batch-mean normalization is not quantified. Furthermore, the theoretical analysis assumes i.i.d. samples, which is violated in the non-stationary, auto-correlated setting of online policy optimization where consecutive batches are highly dependent.

The comparison between DACR+ARE and baselines conflates the interaction protocol with the robust estimator; there is no ablation of ARE within the DACR framework to isolate its specific contribution to the multi-agent gains. The VLN results (Table 2), while positive, show small absolute margins (e.g., SR improvements of 1-2.5 points) and use a smaller model than the baseline, making it unclear whether the gains derive from ARE or simply from different model capacities.

“Optimizing ([8]) jointly over $(X,(w_{i})^{m}_{i=1})$ without regularization degenerates to $w_{i}=0$ for all $i$”
Li et al., Appendix E · Appendix E
“we used a smaller backbone than in (Cai et al., 2025) (leading to lower absolute performance)”
Li et al., Sec. 6.2 · Section 6.2
Evidence and comparison

The evidence supports the claim that ARE improves stability under injected noise (Figure 5), but does not establish superiority over simpler robust alternatives (e.g., trimmed means or Huber losses) in the multi-agent RL context. The comparison to GRPO is fair for the mathematical reasoning task, though the baseline hyperparameters (group size, clipping threshold $\epsilon$) are not fully specified. The paper appropriately positions ARE as a refinement of Median-of-Means (MoM) with adaptive losses rather than a wholly novel estimator, though the discussion in Appendix B regarding MoM's breakdown under adversarial contamination applies partially to ARE as well—if adversaries target the majority of blocks, the median fails.

“ARE can be viewed as a principled refinement of the classical Median-of-Means (MoM) estimator”
Li et al., Sec. 3 · Section 3
“it suffices to inject a single extremely large outlier into each of $\lceil k/2\rceil$ distinct blocks to arbitrarily shift the MoM output”
Li et al., Appendix B · Appendix B
Reproducibility

The authors provide a GitHub repository (https://github.com/bhai114/ARE), but critical implementation details are missing: the number of blocks $k$ for ARE, the GNC continuation schedule parameters (e.g., $\beta$ steps, $p$ or $q$ exponents in Appendix E), and convergence thresholds for the alternating optimization are not specified. The VLN experiments use "downscaled image resolutions" without stating the target resolution. For the mathematical reasoning experiments, while Cauchy noise injection is described qualitatively in Section 6.1 Q3, the exact contamination protocol (random seeds, noise scale parameters) is not detailed. The hyperparameters for GRPO (group size $G$, KL penalty $\beta$, learning rate) are not reported, which are essential for independent reproduction.

“Our code is available at https://github.com/bhai114/ARE”
“with downscaled image resolutions to fit available compute”
Li et al., Sec. 6.2 · Section 6.2
Abstract

Multi-agent collaboration has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models, yet it suffers from interaction-level ambiguity that blurs generation, critique, and revision, making credit assignment across agents difficult. Moreover, policy optimization in this setting is vulnerable to heavy-tailed and noisy rewards, which can bias advantage estimation and trigger unstable or even divergent training. To address both issues, we propose a robust multi-agent reinforcement learning framework for collaborative reasoning, consisting of two components: Dual-Agent Answer-Critique-Rewrite (DACR) and an Adaptive Robust Estimator (ARE). DACR decomposes reasoning into a structured three-stage pipeline: answer, critique, and rewrite, while enabling explicit attribution of each agent's marginal contribution to its partner's performance. ARE provides robust estimation of batch experience means during multi-agent policy optimization. Across mathematical reasoning and embodied intelligence benchmarks, even under noisy rewards, our method consistently outperforms the baseline in both homogeneous and heterogeneous settings. These results indicate stronger robustness to reward noise and more stable training dynamics, effectively preventing optimization failures caused by noisy reward signals.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.