GSEM: Graph-based Self-Evolving Memory for Experience Augmented Clinical Reasoning

cs.AI Xiao Han, Yuzheng Fan, Sendong Zhao, Haochun Wang, Bing Qin · Mar 23, 2026

What it does

Why it matters

90% average accuracy with DeepSeek-V3. 2.

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

GSEM addresses the challenge of building structured experience memory for clinical LLM agents. Unlike flat memory banks storing isolated records, it organizes clinical decisions into a dual-layer graph capturing both internal decision structure (entity layer) and inter-experience relational dependencies (experience layer), supporting applicability-aware retrieval and online feedback-driven calibration of node quality and edge weights. Experiments on medical benchmarks report strong improvements over RAG and memory-augmented baselines, achieving 70.90% average accuracy with DeepSeek-V3.2.

Critical review

Verdict

Bottom line

The paper presents a well-motivated framework tackling genuine problems in experience reuse: boundary-aware applicability and relation-aware composition. The dual-layer graph structure is a credible architectural contribution, and the online calibration mechanism updating $Q_i$ and $W_{ij}$ provides a practical way to refine memory without catastrophic forgetting. However, the evaluation remains confined to public benchmarks with limited case counts, and the "self-evolving" aspect—while conceptually sound—is demonstrated with only 250 updates, leaving open questions about long-term stability and scalability in actual clinical workflows.

“After only 50 updates, self-evolving reduces the number of incorrect cases from 19 to 9 for diagnosis and from 8 to 4 for treatment”

paper · Section 5.3

“we evaluate GSEM on public benchmarks... they do not fully reflect real clinical workflows, interactive decision processes, or deployment constraints”

paper · Section 7

What holds up

The formalization of experiences as tuples $e_i = (c_i, s_i, z_i, Q_i)$ with polarity $z_i \in \{\oplus, \ominus\}$ distinguishing Indications from Contraindications offers a principled way to capture both success and failure patterns. The retrieval mechanism's hybrid design combining entity-based and embedding-based recall is empirically validated through ablations, and the multi-seed graph traversal successfully addresses the "collaboration failure" case shown in Figure 1 where conflicting experiences require joint-use validation.

“$e_i=(c_i, s_i, z_i, Q_i)$ where $z_i\in\{\oplus,\ominus\}$ is a polarity label indicating whether the experience is an Indication or a Contraindication”

paper · Section 3.2

“score$(e_h,e_j)=\frac{1}{2}(W_{hj}+Q_j)$”

paper · Section 4.2

Main concerns

The memory construction process is computationally demanding—requiring $N_{traj}=5$ trajectory samples per training instance plus held-out trials for reliability validation—yet no latency or cost analysis is provided to justify this overhead. More critically, the LLM-guided traversal policy with discrete actions introduces "non-determinism in path selection" that the authors acknowledge as a limitation, creating reliability risks for high-stakes clinical reasoning. The evaluation also conflates the benefits of graph structure with dense retrieval: Table 3 shows that removing embedding-based recall causes catastrophic performance drops (83.78% vs 94.59% on treatment), suggesting the graph edges provide marginal value over strong semantic similarity.

“our retriever relies on an LLM-guided traversal policy with discrete actions... it can be sensitive to prompting and introduce non-determinism in path selection”

paper · Limitations

“w/o embedding_recall: Treat. 83.78 vs GSEM 94.59”

paper · Table 3

Evidence and comparison

Main results in Table 1 support claims of benchmark superiority, with GSEM achieving 94.22% diagnosis and 94.59% treatment accuracy versus 92.57% and 87.16% for the strongest baseline A-Mem. However, comparisons to self-evolving baselines like ReMe and FLEX are somewhat misleading because GSEM's advantage stems primarily from the structured graph retrieval rather than the evolutionary feedback mechanism—the latter showing mixed gains (SE(150) underperforms SE(50) on treatment at 95.95% vs 97.30%). The paper attributes gains to "boundary awareness" and "relation-aware composition," but the ablation suggests strong performance derives mainly from sophisticated embedding-based recall and the underlying generator capacity (Table 5 shows generator swaps impact accuracy by >20 points while retriever swaps have minimal effect).

“GSEM (ours): Diag 94.22, Treat 94.59 vs A-Mem: Diag 93.62, Treat 92.57”

paper · Table 1

“w/o embedding_recall: 83.78”

paper · Table 3

Reproducibility

The authors provide code availability and disclose hyperparameters including learning rates $\eta_Q=0.1$, $\eta_W=0.05$, edge threshold $\theta_{edge}=0.35$, and rank decay $\rho=0.8$. Complete prompt templates are reproduced in Appendix G. However, reproduction faces substantial barriers: the method relies on large proprietary models (DeepSeek-V3.2 671B and Qwen3.5-35B) for generation, retrieval, and judgment, incurring significant API costs. The stochastic nature of trajectory sampling ($N_{traj}=5$) and LLM-guided traversal means results may vary across runs unless temperature=0 is enforced—a detail not confirmed. Furthermore, the evolution experiments require hundreds of sequential test-time updates (50-250), making full reproduction expensive.

“$\eta_Q=0.1$ and $\eta_W=0.05$ respectively, with rank decay factor $\rho=0.8$”

paper · Section 5.1

“Code is available at https://github.com/xhan1022/gsem”

paper · Abstract

Abstract

Clinical decision-making agents can benefit from reusing prior decision experience. However, many memory-augmented methods store experiences as independent records without explicit relational structure, which may introduce noisy retrieval, unreliable reuse, and in some cases even hurt performance compared to direct LLM inference. We propose GSEM (Graph-based Self-Evolving Memory), a clinical memory framework that organizes clinical experiences into a dual-layer memory graph, capturing both the decision structure within each experience and the relational dependencies across experiences, and supporting applicability-aware retrieval and online feedback-driven calibration of node quality and edge weights. Across MedR-Bench and MedAgentsBench with two LLM backbones, GSEM achieves the highest average accuracy among all baselines, reaching 70.90\% and 69.24\% with DeepSeek-V3.2 and Qwen3.5-35B, respectively. Code is available at https://github.com/xhan1022/gsem.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.