Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment
This paper tackles the challenge of automating BT-RADS (Brain Tumor Reporting and Data System) classification for post-treatment glioma MRI surveillance. BT-RADS requires integrating complex information: volumetric tumor changes, medication effects (steroids, bevacizumab), and radiation timing. The authors propose an end-to-end pipeline combining CNN-based tumor segmentation with a multi-agent LLM system to extract clinical variables from unstructured notes and apply algorithmic scoring logic. This matters because manual BT-RADS scoring is error-prone, with prior studies showing substantial inter-reader variability and inconsistent application of clinical context.
The paper presents a technically sound contribution demonstrating that an automated multi-agent system can achieve higher BT-RADS classification accuracy (76.0%) than initial clinical assessments (57.5%), with statistically significant improvement (McNemar P<.001). The system shows particular strength in context-dependent categories requiring clinical reasoning (BT-1a, BT-1b, BT-3a: 87.5%–100% sensitivity) and high positive predictive value for BT-4 (92.9%). However, performance remains comparable to reported human interrater reliability (~80%) rather than clearly exceeding it, and the single-institution retrospective design limits confidence in generalizability. The work is appropriately positioned as decision support rather than replacement for radiologist judgment.
The modular multi-agent architecture effectively separates concerns: an extractor agent handles NLP-based clinical variable identification (steroid status, bevacizumab use, radiation date) with evidence-span linking for verification, while a scorer agent applies deterministic BT-RADS logic. This design supports transparency through schema-constrained generation and Pydantic validation. The use of an open-weight LLM (GPT-oss, 20B parameters) with local inference addresses privacy and cost barriers for clinical deployment. The substantial agreement with expert reference (Cohen's κ = 0.708; quadratic weighted κ = 0.803) demonstrates the system captures the ordinality of BT-RADS categories. The root cause analysis provides actionable insights: 44.1% of errors stem from threshold boundary ambiguity—a fundamentally irreducible challenge in categorizing continuous volumetric data.
Several limitations warrant careful scrutiny. First, the comparison to initial clinical assessments may inflate perceived benefit: routine radiology reports often come from non-neuroradiologists with varying training, whereas the reference standard was a single fellowship-trained neuroradiologist. The system achieving 76% vs. ~80% human interrater reliability suggests it matches, but does not surpass, expert-level consistency. Second, the theoretical performance ceiling analysis (Table 4) constructs optimistic scenarios (88.1%, 99.4%) that are not empirically validated; these projections assume perfect extraction and algorithmic logic that may not be achievable. Third, exclusion of 17 cases (9 for baseline issues, 8 for segmentation quality) without sensitivity analysis on their characteristics raises selection bias concerns. Fourth, threshold-dependent categories (BT-3b particularly at 57.1% sensitivity) struggle when FLAIR and enhancement trends diverge—a known limitation of aggregate volumetric approaches that mask lesion-level heterogeneity. The acknowledgment that the BT-3b "enhancement priority rule did not fully align with expert interpretation" signals unaddressed algorithmic refinement needs.
The evidence generally supports the primary claim that multi-agent automation outperforms routine clinical assessment, though the effect size (18.5 percentage points) should be interpreted in context of the comparison baseline. Performance stratification by category type is well-documented and clinically meaningful—high sensitivity for context-dependent categories validates the LLM extraction component, while moderate sensitivity for threshold-dependent categories reflects inherent measurement uncertainty near categorical boundaries. The comparison to Lee et al.'s NLP-only approach (F1 0.68–0.72) is fair but understates the advantage: this work integrates actual imaging volumetrics rather than relying solely on report text. The prior work by Zhang et al. on volumetric analysis alone is appropriately cited as showing "only moderate performance without subcategory differentiation, highlighting the necessity of clinical context integration." However, the paper does not directly compare against the referenced interrater agreement study (Essien et al., Gwet index 0.83) in its results framing—doing so would contextualize that 76% accuracy, while improved over routine practice, remains below optimal human agreement.
Reproducibility is limited by several factors. The source code is "available upon request for academic research" rather than deposited in a public repository with version control. No training data, model weights, or inference pipelines are publicly accessible. Critical implementation details are insufficiently specified: exact prompt templates for the extractor agent, temperature and sampling parameters beyond "temperature 0.0," and specific Pydantic schema constraints are not provided. The CNN segmentation model is described as "previously described and validated" without citation to architecture details or training regimen in the main text. Hyperparameters for volumetric threshold decisions (±20%, >40%) are derived from BT-RADS standards but the rationale for edge-case handling (e.g., exactly 20% change) is not specified. The use of GPT-oss (20B)—a specific open-weight model—means performance may not generalize to other LLM implementations. The single-institution cohort from a high-volume brain tumor center raises questions about generalizability to community practice with different MRI protocols and case distributions.
The Brain Tumor Reporting and Data System (BT-RADS) standardizes post-treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification. A multi-agent LLM system combined with automated CNN-based tumor segmentation was retrospectively evaluated on 509 consecutive post-treatment glioma MRI examinations from a single high-volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT-RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board-certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%-79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%-61.8%) for initial clinical assessments (+18.5 percentage points; P<.001). Context-dependent categories showed high sensitivity (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), while threshold-dependent categories showed moderate sensitivity (BT-3c 74.8%, BT-2 69.2%, BT-4 69.3%, BT-3b 57.1%). For BT-4, positive predictive value was 92.9%. The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.