Graph of States: Solving Abductive Tasks with Large Language Models

cs.AI Yu Luo, Rongchen Gao, Lu Teng, Xidao Wen, Jiamin Jiang, Qingliang Zhang, Yongqian Sun, Shenglin Zhang, Jiasong Feng, Tong Liu, Wenjie Zhang, Dan Pei · Mar 22, 2026
Local to this browser
What it does
Abductive reasoning—inferring the most probable hypothesis from incomplete observations—remains a critical gap for LLMs despite advances in deductive and inductive tasks. This paper introduces Graph of States (GoS), a neuro-symbolic...
Why it matters
This paper introduces Graph of States (GoS), a neuro-symbolic framework that structures multi-agent collaboration through a causal graph (encoding belief states) and a state machine (governing navigation). By grounding reasoning in...
Main concern
The paper presents a compelling architectural contribution with strong empirical results, though its claim of being "general-purpose" is constrained by evaluation on only two domains. The dual-layer design effectively addresses the...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Abductive reasoning—inferring the most probable hypothesis from incomplete observations—remains a critical gap for LLMs despite advances in deductive and inductive tasks. This paper introduces Graph of States (GoS), a neuro-symbolic framework that structures multi-agent collaboration through a causal graph (encoding belief states) and a state machine (governing navigation). By grounding reasoning in explicit symbolic constraints rather than unstructured context, GoS aims to eliminate Evidence Fabrication, Context Drift, Failed Backtracking, and Early Stopping that plague Chain-of-Thought and Tree-of-Thought when adapted to dynamic, non-monotonic abductive tasks like medical diagnosis and distributed systems failure analysis.

Critical review
Verdict
Bottom line

The paper presents a compelling architectural contribution with strong empirical results, though its claim of being "general-purpose" is constrained by evaluation on only two domains. The dual-layer design effectively addresses the identified failure modes, achieving 39.86% exact match (human evaluation) versus 26.09% for the best baseline in medical diagnosis, and 70.67% versus 28.00% in failure diagnosis. However, the framework still relies heavily on domain-specific agent roles and tool definitions, suggesting that transfer to new domains requires substantial engineering rather than being truly plug-and-play.

“GoS... Match... 39.86... Multi/FoT... 26.09”
paper · Table 1
“GoS... Match... 70.67... Multi/FoT... 28.00”
paper · Table 3
What holds up

The structural critique of deductive frameworks applied to abduction is well-articulated, and the quantitative error analysis rigorously validates GoS's architectural advantages. The framework eliminates Evidence Fabrication entirely (0% vs 22.22% in pooled baselines) while dramatically reducing Context Drift (9.70% vs 41.32%) and Early Stopping (18.75% vs 63.89%). The ablation study confirms that both the causal graph and state machine are essential components; removing either causes catastrophic performance drops from 31.88% to approximately 12.32% match rate in medical diagnosis, validating that explicit belief state maintenance is not merely decorative but functionally critical.

“Evidence Fabrication... Baselines (Pooled)... 22.22... GoS... 0”
paper · Table 4
“w/o causal graph... Match... 12.32... w/o state machine... Match... 12.32”
paper · Table 2
Main concerns

The scope is narrow for a "general-purpose" framework, covering only medical diagnosis (150 cases) and distributed system failures (150 incidents), with the latter dataset proprietary and unreleasable. The exclusion of 12/162 cases from DiagnosisArena for containing "factual errors or logical flaws that make the ground truth unreachable" raises questions about dataset curation, though the authors provide detailed case-by-case justifications. The claim of being the "first general-purpose multi-agent reasoning framework tailored for abductive reasoning" is strong given that prior work like MDAgents also uses multi-agent role-based collaboration for medical diagnosis; the distinction hinges on GoS's domain-agnostic symbolic layer, but this theoretical separation is not validated across diverse domains beyond the two tested. Additionally, Wrong Action Selection remains the dominant failure mode (55.90%), indicating that the framework's effectiveness is still bounded by the underlying LLM's domain knowledge.

“Wrong Action Selection... GoS... 55.90”
paper · Appendix D
“excluding 12 cases for containing factual errors or logical flaws”
paper · Appendix C.1
Evidence and comparison

The evidence robustly supports the core claim that symbolic constraints improve abductive reasoning: GoS outperforms eight baselines spanning single/multi-agent and CoT/ToT/GoT/FoT topologies by substantial margins while being cost-efficient ($0.12/case vs $0.94 for Multi/FoT in failure diagnosis). The comparison is fair in that all methods use the same GPT-5.1-2025-11-13 backbone and ReAct-based tool invocation. However, the paper does not compare against neuro-symbolic approaches outside the LLM-reasoning paradigm (e.g., traditional Bayesian networks or abductive logic programming enhanced with LLMs), focusing instead on pure LLM reasoning topologies. The sensitivity analysis showing trade-offs between precision and conservatism via dual-thresholds $\delta$ and $\eta$ provides useful operational guidance.

“GoS... $/case... 0.10... Multi/FoT... $/case... 0.94”
paper · Table 3
“dual-thresholds ($\eta$,$\delta$)... trade-off between precision and conservatism”
paper · Figure 5 caption
Reproducibility

The paper provides concrete hyperparameters: maximum 3 neuro-symbolic interaction iterations, 3 retrieval actions per expert agent (5 for single-agent baselines), and dual-thresholds $\delta$ (confidence gap) and $\eta$ (minimum support evidence) for state transitions. The authors commit to open-source code and prompts at an anonymous repository. However, the distributed systems dataset is confidential and cannot be released due to "company confidentiality, privacy, and compliance requirements," limiting full reproduction of the failure diagnosis experiments. The medical dataset (DiagnosisArena) is public but modified by excluding 12 cases; precise documentation of the tool definitions, agent role prompts, and causal graph update algorithms would be required for independent reproduction.

“maximum number of neuro-symbolic interaction iterations in GoS to 3”
paper · Section 4
“Due to company confidentiality... we cannot publicly release the raw data”
paper · Appendix C.2
Abstract

Logical reasoning encompasses deduction, induction, and abduction. However, while Large Language Models (LLMs) have effectively mastered the former two, abductive reasoning remains significantly underexplored. Existing frameworks, predominantly designed for static deductive tasks, fail to generalize to abductive reasoning due to unstructured state representation and lack of explicit state control. Consequently, they are inevitably prone to Evidence Fabrication, Context Drift, Failed Backtracking, and Early Stopping. To bridge this gap, we introduce Graph of States (GoS), a general-purpose neuro-symbolic framework tailored for abductive tasks. GoS grounds multi-agent collaboration in a structured belief states, utilizing a causal graph to explicitly encode logical dependencies and a state machine to govern the valid transitions of the reasoning process. By dynamically aligning the reasoning focus with these symbolic constraints, our approach transforms aimless, unconstrained exploration into a convergent, directed search. Extensive evaluations on two real-world datasets demonstrate that GoS significantly outperforms all baselines, providing a robust solution for complex abductive tasks. Code repo and all prompts: https://anonymous.4open.science/r/Graph-of-States-5B4E.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.