Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

cs.AI Zhongyi Li, Wan Tian, Yikun Ban, Jinju Chen, Huiming Zhang, Yang Liu, Fuzhen Zhuang · Mar 23, 2026
Local to this browser
What it does
Collaborative multi-agent LLM systems struggle with credit assignment during RL training: shared terminal rewards obscure individual contributions, encouraging free-riding and high gradient variance. This paper introduces CCPO...
Why it matters
This paper introduces CCPO (Counterfactual Credit Policy Optimization), which estimates each agent's marginal contribution by contrasting actual team performance with counterfactual outcomes—simulating performance without that specific...
Main concern
The paper addresses a genuine bottleneck in multi-agent LLM training and provides a pragmatic solution for sequential (Think–Solve) and parallel (Voting) collaboration topologies. The counterfactual formulation is theoretically grounded...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Collaborative multi-agent LLM systems struggle with credit assignment during RL training: shared terminal rewards obscure individual contributions, encouraging free-riding and high gradient variance. This paper introduces CCPO (Counterfactual Credit Policy Optimization), which estimates each agent's marginal contribution by contrasting actual team performance with counterfactual outcomes—simulating performance without that specific agent. The method targets efficient discrete generation for LLMs and demonstrates improved reasoning accuracy across mathematical and logical benchmarks.

Critical review
Verdict
Bottom line

The paper addresses a genuine bottleneck in multi-agent LLM training and provides a pragmatic solution for sequential (Think–Solve) and parallel (Voting) collaboration topologies. The counterfactual formulation is theoretically grounded for variance reduction (Theorem 5.3), and empirical gains, while modest (1.5–4.6% over ReMA), are consistent across model scales. However, the method is currently limited to terminal reward settings with deterministic aggregation, and the computational overhead of counterfactual sampling in sequential settings—acknowledged but not fully benchmarked—remains a practical concern.

“To construct the counterfactual for Agent 1, we additionally sample NN rollouts where Agent 2 answers without access to y1”
paper · Section 4.1
“If R_{\neg k} is closer to b^{\star} than 0 in the weighted mean-square sense... then Var(g_{k}(\tau_{k})\Delta_{k}\mid x,\theta_{-k}) \leq Var(g_{k}(\tau_{k})R(\tau)\mid x,\theta_{-k})”
paper · Theorem 5.3
What holds up

The core insight—that counterfactual baselines provide unbiased, lower-variance gradients than shared rewards—is rigorously justified (Lemma A.1, Theorem 5.3). The voting instantiation is particularly elegant, requiring 'no additional decoding' to compute counterfactual rewards. The global-history-aware normalization using EMA statistics (Eq. 2–4) is a practical stabilization technique for heterogeneous task distributions. The Lazy Agent ablation (Table 3) effectively validates that the collaboration mechanism is actually utilized rather than ignored by the Solver.

“This computation is performed within the same sampling instance and adds negligible overhead”
paper · Section 4.1
“the counterfactual reward exhibits a distribution that better aligns with human intuition (J: the score of the two agents' joint answer, S: the score of Agent2's solo answer)”
paper · Figure 4 caption
Main concerns

Three limitations stand out. First, the sequential Think–Solve instantiation requires additional sampling of solo trajectories from the Solver ($y_{2,\mathrm{solo}}^{(j)} \sim \pi_{\theta_{2}}(\cdot \mid x)$), effectively doubling generation costs for Agent 1's credit assessment—this overhead is mentioned but not empirically compared against baselines. Second, the theoretical guarantees assume action-independence of counterfactual rewards (Lemma A.1), which may not hold if agents learn policies that implicitly condition on the counterfactual removal. Third, the evaluation scope is narrow: all tasks use binary terminal rewards, leaving unclear whether the method extends to process-level or partial-credit signals; Table 1 also contains missing entries (dashes) that complicate cross-dataset comparisons.

“To construct the counterfactual for Agent 1, we additionally sample NN rollouts where Agent 2 answers without access to y1”
paper · Appendix B.2
“Assume that conditioned on (x,\theta_{-k}), the random variable R_{\neg k} is independent of agent k's sampled actions in the joint rollout”
paper · Lemma A.1
“-”
paper · Table 1
Evidence and comparison

The evidence supports the claim that CCPO improves over shared-reward baselines and ReMA (Wan et al., 2025). On MATH500 with Qwen2.5-1.5B, CCPO achieves 61.0% versus ReMA's 60.0% and single-agent GRPO's 56.8% (Table 1). The trend holds across Llama and Qwen3-4B models, with CCPO generally outperforming on out-of-distribution sets like AMC23. The comparison appears fair given similar training data regimes (MATH 7.5k), though ReMA's specific architecture is not detailed here. The reward distribution visualization (Figure 4) provides intuitive evidence that counterfactual rewards better discriminate between high and low contributors than shared rewards.

“qwen2.5-1.5b instruct... Ours 61.00... ReMA 60.00... GRPO 56.80”
paper · Table 1
“Experiment 1 achieves higher accuracy than Experiment 2, which demonstrates that the cooperative mechanism is indeed effective”
paper · Section 6.2.1
Reproducibility

The authors provide a GitHub repository and detailed hyperparameters in Appendix C (learning rate $1\times 10^{-6}$, batch size 64, $\epsilon=0.2$). The use of standard models (Qwen2.5, Llama3.1) and datasets (MATH, LogiQA) facilitates reproduction. However, the implementation involves complex gating mechanisms for the Solver in Think–Solve mode (Appendix B.2) involving the trust coefficient $g = \sigma(\eta \cdot \mu_\Delta / \sigma_\Delta)$, which adds complexity not fully ablated in the main text. While the EMA decay ($\lambda=0.99$) and tanh shaping ($\alpha=1.0$) are reported, the sensitivity of results to these hyperparameters remains unexplored.

“Learning rate: $1\times 10^{-6}$, Batch size: 64, Clip ratio ($\epsilon$): 0.2, EMA decay ($\lambda$): 0.99”
paper · Appendix C Table 4
“Our code is available at https://github.com/bhai114/ccpo”
paper · Abstract
Abstract

Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles and aggregating diverse hypotheses. Yet, reinforcement learning (RL) for such systems is often undermined by credit assignment: a shared global reward obscures individual contributions, inflating update variance and encouraging free-riding. We introduce Counterfactual Credit Policy Optimization (CCPO), a framework that assigns agent-specific learning signals by estimating each agent's marginal contribution through counterfactual trajectories. CCPO builds dynamic counterfactual baselines that simulate outcomes with an agent's contribution removed, yielding role-sensitive advantages for policy optimization. To further improve stability under heterogeneous tasks and data distributions, we propose a global-history-aware normalization scheme that calibrates advantages using global rollout statistics. We evaluate CCPO on two collaboration topologies: a sequential Think--Reason dyad and multi-agent voting. Across mathematical and logical reasoning benchmarks, CCPO mitigates free-riding and outperforms strong multi-agent RL baselines, yielding finer-grained and more effective credit assignment for collaborative LLM training. Our code is available at https://github.com/bhai114/ccpo.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.