Counterfactual Credit Policy Optimization for Multi-Agent Collaboration
Collaborative multi-agent LLM systems struggle with credit assignment during RL training: shared terminal rewards obscure individual contributions, encouraging free-riding and high gradient variance. This paper introduces CCPO (Counterfactual Credit Policy Optimization), which estimates each agent's marginal contribution by contrasting actual team performance with counterfactual outcomes—simulating performance without that specific agent. The method targets efficient discrete generation for LLMs and demonstrates improved reasoning accuracy across mathematical and logical benchmarks.
The paper addresses a genuine bottleneck in multi-agent LLM training and provides a pragmatic solution for sequential (Think–Solve) and parallel (Voting) collaboration topologies. The counterfactual formulation is theoretically grounded for variance reduction (Theorem 5.3), and empirical gains, while modest (1.5–4.6% over ReMA), are consistent across model scales. However, the method is currently limited to terminal reward settings with deterministic aggregation, and the computational overhead of counterfactual sampling in sequential settings—acknowledged but not fully benchmarked—remains a practical concern.
The core insight—that counterfactual baselines provide unbiased, lower-variance gradients than shared rewards—is rigorously justified (Lemma A.1, Theorem 5.3). The voting instantiation is particularly elegant, requiring 'no additional decoding' to compute counterfactual rewards. The global-history-aware normalization using EMA statistics (Eq. 2–4) is a practical stabilization technique for heterogeneous task distributions. The Lazy Agent ablation (Table 3) effectively validates that the collaboration mechanism is actually utilized rather than ignored by the Solver.
Three limitations stand out. First, the sequential Think–Solve instantiation requires additional sampling of solo trajectories from the Solver ($y_{2,\mathrm{solo}}^{(j)} \sim \pi_{\theta_{2}}(\cdot \mid x)$), effectively doubling generation costs for Agent 1's credit assessment—this overhead is mentioned but not empirically compared against baselines. Second, the theoretical guarantees assume action-independence of counterfactual rewards (Lemma A.1), which may not hold if agents learn policies that implicitly condition on the counterfactual removal. Third, the evaluation scope is narrow: all tasks use binary terminal rewards, leaving unclear whether the method extends to process-level or partial-credit signals; Table 1 also contains missing entries (dashes) that complicate cross-dataset comparisons.
The evidence supports the claim that CCPO improves over shared-reward baselines and ReMA (Wan et al., 2025). On MATH500 with Qwen2.5-1.5B, CCPO achieves 61.0% versus ReMA's 60.0% and single-agent GRPO's 56.8% (Table 1). The trend holds across Llama and Qwen3-4B models, with CCPO generally outperforming on out-of-distribution sets like AMC23. The comparison appears fair given similar training data regimes (MATH 7.5k), though ReMA's specific architecture is not detailed here. The reward distribution visualization (Figure 4) provides intuitive evidence that counterfactual rewards better discriminate between high and low contributors than shared rewards.
The authors provide a GitHub repository and detailed hyperparameters in Appendix C (learning rate $1\times 10^{-6}$, batch size 64, $\epsilon=0.2$). The use of standard models (Qwen2.5, Llama3.1) and datasets (MATH, LogiQA) facilitates reproduction. However, the implementation involves complex gating mechanisms for the Solver in Think–Solve mode (Appendix B.2) involving the trust coefficient $g = \sigma(\eta \cdot \mu_\Delta / \sigma_\Delta)$, which adds complexity not fully ablated in the main text. While the EMA decay ($\lambda=0.99$) and tanh shaping ($\alpha=1.0$) are reported, the sensitivity of results to these hyperparameters remains unexplored.
Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles and aggregating diverse hypotheses. Yet, reinforcement learning (RL) for such systems is often undermined by credit assignment: a shared global reward obscures individual contributions, inflating update variance and encouraging free-riding. We introduce Counterfactual Credit Policy Optimization (CCPO), a framework that assigns agent-specific learning signals by estimating each agent's marginal contribution through counterfactual trajectories. CCPO builds dynamic counterfactual baselines that simulate outcomes with an agent's contribution removed, yielding role-sensitive advantages for policy optimization. To further improve stability under heterogeneous tasks and data distributions, we propose a global-history-aware normalization scheme that calibrates advantages using global rollout statistics. We evaluate CCPO on two collaboration topologies: a sequential Think--Reason dyad and multi-agent voting. Across mathematical and logical reasoning benchmarks, CCPO mitigates free-riding and outperforms strong multi-agent RL baselines, yielding finer-grained and more effective credit assignment for collaborative LLM training. Our code is available at https://github.com/bhai114/ccpo.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.