GSEM: Graph-based Self-Evolving Memory for Experience Augmented Clinical Reasoning
GSEM addresses the challenge of building structured experience memory for clinical LLM agents. Unlike flat memory banks storing isolated records, it organizes clinical decisions into a dual-layer graph capturing both internal decision structure (entity layer) and inter-experience relational dependencies (experience layer), supporting applicability-aware retrieval and online feedback-driven calibration of node quality and edge weights. Experiments on medical benchmarks report strong improvements over RAG and memory-augmented baselines, achieving 70.90% average accuracy with DeepSeek-V3.2.
The paper presents a well-motivated framework tackling genuine problems in experience reuse: boundary-aware applicability and relation-aware composition. The dual-layer graph structure is a credible architectural contribution, and the online calibration mechanism updating $Q_i$ and $W_{ij}$ provides a practical way to refine memory without catastrophic forgetting. However, the evaluation remains confined to public benchmarks with limited case counts, and the "self-evolving" aspect—while conceptually sound—is demonstrated with only 250 updates, leaving open questions about long-term stability and scalability in actual clinical workflows.
The formalization of experiences as tuples $e_i = (c_i, s_i, z_i, Q_i)$ with polarity $z_i \in \{\oplus, \ominus\}$ distinguishing Indications from Contraindications offers a principled way to capture both success and failure patterns. The retrieval mechanism's hybrid design combining entity-based and embedding-based recall is empirically validated through ablations, and the multi-seed graph traversal successfully addresses the "collaboration failure" case shown in Figure 1 where conflicting experiences require joint-use validation.
The memory construction process is computationally demanding—requiring $N_{traj}=5$ trajectory samples per training instance plus held-out trials for reliability validation—yet no latency or cost analysis is provided to justify this overhead. More critically, the LLM-guided traversal policy with discrete actions introduces "non-determinism in path selection" that the authors acknowledge as a limitation, creating reliability risks for high-stakes clinical reasoning. The evaluation also conflates the benefits of graph structure with dense retrieval: Table 3 shows that removing embedding-based recall causes catastrophic performance drops (83.78% vs 94.59% on treatment), suggesting the graph edges provide marginal value over strong semantic similarity.
Main results in Table 1 support claims of benchmark superiority, with GSEM achieving 94.22% diagnosis and 94.59% treatment accuracy versus 92.57% and 87.16% for the strongest baseline A-Mem. However, comparisons to self-evolving baselines like ReMe and FLEX are somewhat misleading because GSEM's advantage stems primarily from the structured graph retrieval rather than the evolutionary feedback mechanism—the latter showing mixed gains (SE(150) underperforms SE(50) on treatment at 95.95% vs 97.30%). The paper attributes gains to "boundary awareness" and "relation-aware composition," but the ablation suggests strong performance derives mainly from sophisticated embedding-based recall and the underlying generator capacity (Table 5 shows generator swaps impact accuracy by >20 points while retriever swaps have minimal effect).
The authors provide code availability and disclose hyperparameters including learning rates $\eta_Q=0.1$, $\eta_W=0.05$, edge threshold $\theta_{edge}=0.35$, and rank decay $\rho=0.8$. Complete prompt templates are reproduced in Appendix G. However, reproduction faces substantial barriers: the method relies on large proprietary models (DeepSeek-V3.2 671B and Qwen3.5-35B) for generation, retrieval, and judgment, incurring significant API costs. The stochastic nature of trajectory sampling ($N_{traj}=5$) and LLM-guided traversal means results may vary across runs unless temperature=0 is enforced—a detail not confirmed. Furthermore, the evolution experiments require hundreds of sequential test-time updates (50-250), making full reproduction expensive.
Clinical decision-making agents can benefit from reusing prior decision experience. However, many memory-augmented methods store experiences as independent records without explicit relational structure, which may introduce noisy retrieval, unreliable reuse, and in some cases even hurt performance compared to direct LLM inference. We propose GSEM (Graph-based Self-Evolving Memory), a clinical memory framework that organizes clinical experiences into a dual-layer memory graph, capturing both the decision structure within each experience and the relational dependencies across experiences, and supporting applicability-aware retrieval and online feedback-driven calibration of node quality and edge weights. Across MedR-Bench and MedAgentsBench with two LLM backbones, GSEM achieves the highest average accuracy among all baselines, reaching 70.90\% and 69.24\% with DeepSeek-V3.2 and Qwen3.5-35B, respectively. Code is available at https://github.com/xhan1022/gsem.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.