PPGL-Swarm: Integrated Multimodal Risk Stratification and Hereditary Syndrome Detection in Pheochromocytoma and Paraganglioma

cs.CV Zelin Liu, Xiangfu Yu, Jie Huang, Ge Wang, Yizhe Yuan, Zhenyu Yi, Jing Xie, Haotian Jiang, Lichi Zhang · Mar 23, 2026
Local to this browser
What it does
Pheochromocytomas and paragangliomas (PPGLs) are rare neuroendocrine tumors with 15–25% metastatic risk and poor survival. Manual GAPP scoring for metastatic risk is labor-intensive and subjective, while critical genotype information (e.
Why it matters
, SDHB mutations conferring 35–75% metastatic risk) is often missed in clinical practice. This paper introduces PPGL-Swarm, an agentic diagnostic system that decomposes diagnosis into specialized WSI, gene, and table agents coordinated via...
Main concern
This is a technically solid contribution to rare-disease AI diagnostics. The decomposition of GAPP scoring into micro-tasks with specialized agents—including quantitative regression for Ki-67 and cellularity rather than subjective ordinal...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Pheochromocytomas and paragangliomas (PPGLs) are rare neuroendocrine tumors with 15–25% metastatic risk and poor survival. Manual GAPP scoring for metastatic risk is labor-intensive and subjective, while critical genotype information (e.g., SDHB mutations conferring 35–75% metastatic risk) is often missed in clinical practice. This paper introduces PPGL-Swarm, an agentic diagnostic system that decomposes diagnosis into specialized WSI, gene, and table agents coordinated via reinforcement learning to automate GAPP scoring, predict hereditary mutations (SDHB/VHL/RET) from histology alone, and generate auditable multimodal reports grounded in a structured knowledge graph.

Critical review
Verdict
Bottom line

This is a technically solid contribution to rare-disease AI diagnostics. The decomposition of GAPP scoring into micro-tasks with specialized agents—including quantitative regression for Ki-67 and cellularity rather than subjective ordinal grading—addresses real clinical pain points. The reinforcement learning framework for coordinating swarms and the structured knowledge graph for preventing hallucinated genotype associations represent sensible architectural choices supported by strong ablation evidence. However, the clinical validation is limited to a single-center retrospective cohort, and claims about "user feedback" lack systematic methodology.

“GAPP scoring demands a high workload for clinician because it requires the manual evaluation of six independent components; key components such as cellularity and Ki-67 are often evaluated with subjective criteria”
paper · Abstract
“Users feedback indicates that they value its quantitative Ki-67/density analysis, integrated clinical knowledge, and genotype-based alerts”
paper · Conclusion
What holds up

The multi-agent architecture effectively addresses the complexity of multimodal PPGL diagnosis. The WSI swarm’s use of a frozen UNI-v2 encoder with task-specific heads for classification and regression is sound, and the test-time adaptation strategy (LAB color normalization plus AdaBN) demonstrates practical awareness of staining variability across institutions. The mutation prediction heads (SDHB, VHL, RET) enable risk stratification even when genetic testing is unavailable, addressing a genuine access-to-care issue. Ablation studies rigorously validate each component: removing the knowledge graph causes the largest drop in clinical actionability (3.0 → 2.3), while removing RL increases GAPP total MAE (1.2 → 1.5).

“The Gene swarm additionally incorporates three binary mutation prediction heads (SDHB, VHL, RET) built on the same frozen $f_\phi$ and $g_\psi$ from the WSI swarm”
paper · Section 2
“w/o Knowledge Graph ... 2.83 [Overall] ... w/o RL ... 2.85”
paper · Table 3
Main concerns

The primary limitation is the single-center retrospective dataset (n=268, 1,168 WSIs from Ruijin Hospital) with no external validation cohort, raising serious concerns about generalizability to other populations and staining protocols. The claim of "state-of-the-art performance" rests on overlapping confidence intervals without statistical significance testing (e.g., Diagnostic Accuracy 3.6±0.3 vs TITAN 3.4±0.3). The comparison includes general-purpose LLMs (GPT-4o, Claude) that are demonstrably ill-suited for gigapixel pathology, which flatters the proposed method but obscures meaningful comparison with pathology-specific baselines (SlideChat, TITAN). Crucially, the conclusion cites unspecified "user feedback" without any Methods or Results section describing human factors evaluation, constituting an unsupported claim. The knowledge graph curation process is underspecified—manual versus automated construction matters for maintenance and generalization.

“Our model achieves state-of-the-art performance in report quality, GAPP accuracy, and mutation prediction accuracy”
paper · Abstract
“Dataset. We collected 268 patients and 1,168 whole slide images (WSIs)”
paper · Section 3
Evidence and comparison

The evidence supports the technical efficacy of the swarm architecture relative to the presented baselines, though the comparison mixes unfair matchups (general LLMs on thumbnail patches) with credible pathology competitors (SlideChat, TITAN). The prediction of genotype from histology ($c_m \in [0,1]$) achieves F1=67.8% versus 65.3% for TITAN, which could clinically benefit patients lacking genetic testing access, especially given SDHB’s prognostic importance (35–75% metastatic risk). The ablation studies provide compelling internal validation: replacing the multi-agent architecture with a single agent degrades GAPP MAE from 1.2 to 2.3 and gene mutation F1 from 67.8% to 60.2%. However, the lack of external test sets or reader studies comparing pathologists with versus without AI assistance limits the clinical evidence.

“SDHB mutations, which have been associated with reported metastatic rates of 35–75%”
paper · Abstract
“Single Agent ... 2.3 [GAPP Total MAE] ... 60.2 [Gene Mut. F1]”
paper · Table 4
Reproducibility

Reproducibility is constrained by the absence of code or data availability statements—critical gaps given the proprietary nature of the medical data and the complexity of the multi-agent pipeline. Implementation details are thorough (UNI-v2 features, TransMIL architecture, Adam optimizer with $10^{-4}$ learning rate, $\gamma=0.95$, $\lambda_1=0.1$, $\lambda_2=0.2$) and five-fold cross-validation reduces variance, but the reliance on Qwen3.5-35B-A3B—a commercial model with potential access restrictions—limits independent replication. The knowledge graph construction methodology is described only at a high level ("nodes encode PPGL-relevant entities"), leaving uncertainty about curation protocols essential for reproducing the grounding mechanism. No hyperparameter sensitivity analysis is reported.

“The central decision agent was initialised from Qwen3.5-35B-A3B and optimised using policy gradient with $\gamma=0.95$, $\lambda=0.97$, $\lambda_1=0.1$, and $\lambda_2=0.2$”
paper · Section 2
“structured knowledge graph whose nodes encode PPGL-relevant entities, including genetic variants, biochemical phenotypes, and syndromic classifications”
paper · Section 2
Abstract

Pheochromocytomas and paragangliomas (PPGLs) are rare neuroendocrine tumors, of which 15-25% develop metastatic disease with 5-year survival rates reported as low as 34%. PPGL may indicate hereditary syndromes requiring stricter, syndrome-specific treatment and surveillance, but clinicians often fail to recognize these associations in routine care. Clinical practice uses GAPP score for PPGL grading, but several limitations remain for PPGL diagnosis: (1) GAPP scoring demands a high workload for clinician because it requires the manual evaluation of six independent components; (2) key components such as cellularity and Ki-67 are often evaluated with subjective criteria; (3) several clinically relevant metastatic risk factors are not captured by GAPP, such as SDHB mutations, which have been associated with reported metastatic rates of 35-75%. Agent-driven diagnostic systems appear promising, but most lack traceable reasoning for decision-making and do not incorporate domain-specific knowledge such as PPGL genotype information. To address these limitations, we present PPGL-Swarm, an agentic PPGL diagnostic system that generates a comprehensive report, including automated GAPP scoring (with quantified cellularity and Ki-67), genotype risk alerts, and multimodal report with integrated evidence. The system provides an auditable reasoning trail by decomposing diagnosis into micro-tasks, each assigned to a specialized agent. The gene and table agents use knowledge enhancement to better interpret genotype and laboratory findings, and during training we use reinforcement learning to refine tool selection and task assignment.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.