Efficient Failure Management for Multi-Agent Systems with Reasoning Trace Representation

cs.SE cs.AI Lingzhe Zhang, Tong Jia, Mingyu Wang, Weijie Hong, Chiming Duan, Minghua He, Rongqian Wang, Xi Peng, Meiling Wang, Gong Zhang, Renhai Chen, Ying Li · Mar 23, 2026
Local to this browser
What it does
This paper addresses the challenge of efficient failure management in LLM-based Multi-Agent Systems (MASs). Existing approaches rely on expensive per-trace reasoning with large judge LLMs, which is slow and unstable.
Why it matters
Existing approaches rely on expensive per-trace reasoning with large judge LLMs, which is slow and unstable. The core contribution is EAGER, a framework that uses unsupervised reasoning-scoped contrastive learning to encode intra-agent and...
Main concern
EAGER proposes a promising shift from per-trace LLM reasoning to embedding-based retrieval for MAS failure management, supported by a technically sound contrastive learning framework. However, the evaluation is preliminary and lacks...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper addresses the challenge of efficient failure management in LLM-based Multi-Agent Systems (MASs). Existing approaches rely on expensive per-trace reasoning with large judge LLMs, which is slow and unstable. The core contribution is EAGER, a framework that uses unsupervised reasoning-scoped contrastive learning to encode intra-agent and inter-agent dynamics into embeddings, enabling real-time step-wise failure detection and reflexive mitigation guided by historical patterns rather than costly LLM inference.

Critical review
Verdict
Bottom line

EAGER proposes a promising shift from per-trace LLM reasoning to embedding-based retrieval for MAS failure management, supported by a technically sound contrastive learning framework. However, the evaluation is preliminary and lacks critical baselines. While the paper demonstrates that general text embeddings fail on reasoning traces and that failure patterns are concentrated in specific systems, it does not experimentally validate that EAGER outperforms existing methods like RAFFLES or TRAIL in accuracy or efficiency. The moderate diagnosis F1 scores (63-79%) and acknowledged generalization limitations of the lightly fine-tuned model raise concerns about practical deployment readiness.

“our current representation model is only lightly fine-tuned from the Qwen-0.6B-Embedding backbone, which limits its generalization capability.”
paper · Section 4
“Failure Diagnosis ... 63.23% ... 78.76% ... 69.51%”
paper · Table 3
What holds up

The empirical observation that failures in specific MASs are concentrated and recurring is well-supported by Table 1, which shows highly distinct failure distributions across AutoGen-Code, RCLAgent, and SWE-Agent. The critique of existing embeddings is validated by Table 2 showing Qwen3 and BGE-M3 achieve only 13.3% and 22.2% Recall@10 on reasoning trace retrieval, confirming the need for specialized representations. The contrastive learning formulation $\mathcal{L}_{\text{total}}=\lambda_{1}\mathcal{L}_{\text{intra}}+\lambda_{2}\mathcal{L}_{\text{inter}}+\lambda_{3}\mathcal{L}_{\text{rank}}$ provides a principled way to capture hierarchical reasoning structure without labeled failure data.

“Qwen3-0.6B-Embedding ... 13.3% ... BGE-M3-Embedding ... 22.2%”
paper · Table 2
“$\mathcal{L}_{\text{total}}=\lambda_{1}\mathcal{L}_{\text{intra}}+\lambda_{2}\mathcal{L}_{\text{inter}}+\lambda_{3}\mathcal{L}_{\text{rank}}$”
paper · Equation 1
Main concerns

The paper claims per-trace reasoning is "extremely time-consuming" but provides no latency comparison against judge LLM baselines to validate EAGER's efficiency advantage. The ~5s detection latency (Table 3) lacks context on request throughput or cost comparison. The failure diagnosis accuracy is moderate (F1 63-79%) but not benchmarked against existing methods like RAFFLES or AgenTracer. The empirical embedding evaluation uses only 45 traces, which is insufficient for statistical significance. The assumption that "semantically similar questions tend to yield reasoning traces with analogous structures" underpins the contrastive approach but remains untested. Additionally, the distinction between "general-purpose" and "practical" MASs is asserted without formal criteria, limiting the generalizability of claims.

“While effective, this approach is extremely time-consuming: not only does it analyze each trace individually, but the use of a large judge LLM further increases computational overhead.”
paper · Section 1, Per-Trace Reasoning
“semantically similar questions tend to yield reasoning traces with analogous structures and logical progressions in most cases”
paper · Section 3.2
Evidence and comparison

The evidence supports the existence of failure concentration and embedding inadequacy, but fails to demonstrate EAGER's superiority over cited alternatives. The critique that judge LLMs are "inherently unstable" is compelling but not quantified against EAGER's stability. Related work comparison is descriptive rather than experimental; despite citing TRAIL, MAST, Who&When, and RAFFLES, the paper provides no head-to-head accuracy or efficiency benchmarks. The task performance improvement in Table 4 (+2-4% Recall) shows EAGER can enhance RCLAgent, but without ablation studies, it is unclear whether this stems from detection accuracy or the reflexive mitigation mechanism.

“these LLMs are inherently unstable. This instability means that the same failure may sometimes be analyzed correctly and sometimes incorrectly.”
paper · Section 1, Neglecting Historical Failure Patterns
“RCLAgent + EAGER ... 30.19% ... 48.65% MRR”
paper · Table 4
Reproducibility

Critical reproducibility barriers exist. No code repository or dataset links are provided. The training hyperparameters ($\lambda_1, \lambda_2, \lambda_3$ values, batch size, epochs) are unspecified. The exact dataset sizes for the detection and diagnosis experiments (beyond the 45 traces for embedding evaluation) are not stated. Hardware specifications and training time are omitted. The reflexive mitigation mechanism is described conceptually ("model-centric reflection" vs "orchestration-centric reflection") but lacks algorithmic detail on how retrieved historical knowledge concretely guides the regeneration or replanning process, making independent reproduction difficult.

“our current representation model is only lightly fine-tuned from the Qwen-0.6B-Embedding backbone”
paper · Section 4
“When step-wise detection precisely identifies a specific agent's failure, EAGER performs a model-centric reflection... Conversely, when the entire reasoning trace is deemed faulty, EAGER triggers an orchestration-centric reflection”
paper · Section 3.1, Reflexive Mitigation
Abstract

Large Language Models (LLM)-based Multi-Agent Systems (MASs) have emerged as a new paradigm in software system design, increasingly demonstrating strong reasoning and collaboration capabilities. As these systems become more complex and autonomous, effective failure management is essential to ensure reliability and availability. However, existing approaches often rely on per-trace reasoning, which leads to low efficiency, and neglect historical failure patterns, limiting diagnostic accuracy. In this paper, we conduct a preliminary empirical study to demonstrate the necessity, potential, and challenges of leveraging historical failure patterns to enhance failure management in MASs. Building on this insight, we propose \textbf{EAGER}, an efficient failure management framework for multi-agent systems based on reasoning trace representation. EAGER employs unsupervised reasoning-scoped contrastive learning to encode both intra-agent reasoning and inter-agent coordination, enabling real-time step-wise failure detection, diagnosis, and reflexive mitigation guided by historical failure knowledge. Preliminary evaluations on three open-source MASs demonstrate the effectiveness of EAGER and highlight promising directions for future research in reliable multi-agent system operations.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.