Agentic Automation of BT-RADS Scoring: End-to-End Multi-Agent System for Standardized Brain Tumor Follow-up Assessment

cs.CL cs.MA Mohamed Sobhi Jabal (1), Jikai Zhang (2, 3), Dominic LaBella (4), Jessica L. Houk (1), Dylan Zhang (1, 7), Jeffrey D. Rudie (5, 8), Kirti Magudia (1), Maciej A. Mazurowski (1, 2, 6), Evan Calabrese (1, 3) ((1) Duke University Medical Center, Durham NC, (2) Duke University, Durham NC, (3) Duke Center for Artificial Intelligence in Radiology, Durham NC, (4) Duke University Medical Center, Durham NC, (5) University of California San Diego, San Diego CA, (6) Duke University School of Medicine, Durham NC, (7) Santa Clara Valley Medical Center, San Jose CA, (8) Scripps Clinic Medical Group, San Diego CA) · Mar 23, 2026

What it does

Why it matters

The authors propose an end-to-end pipeline combining CNN-based tumor segmentation with a multi-agent LLM system to extract clinical variables from unstructured notes and apply algorithmic scoring logic. This matters because manual BT-RADS...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper tackles the challenge of automating BT-RADS (Brain Tumor Reporting and Data System) classification for post-treatment glioma MRI surveillance. BT-RADS requires integrating complex information: volumetric tumor changes, medication effects (steroids, bevacizumab), and radiation timing. The authors propose an end-to-end pipeline combining CNN-based tumor segmentation with a multi-agent LLM system to extract clinical variables from unstructured notes and apply algorithmic scoring logic. This matters because manual BT-RADS scoring is error-prone, with prior studies showing substantial inter-reader variability and inconsistent application of clinical context.

Critical review

Verdict

Bottom line

The paper presents a technically sound contribution demonstrating that an automated multi-agent system can achieve higher BT-RADS classification accuracy (76.0%) than initial clinical assessments (57.5%), with statistically significant improvement (McNemar P<.001). The system shows particular strength in context-dependent categories requiring clinical reasoning (BT-1a, BT-1b, BT-3a: 87.5%–100% sensitivity) and high positive predictive value for BT-4 (92.9%). However, performance remains comparable to reported human interrater reliability (~80%) rather than clearly exceeding it, and the single-institution retrospective design limits confidence in generalizability. The work is appropriately positioned as decision support rather than replacement for radiologist judgment.

“The system achieved 374/492 (76.0%; 95% CI, 72.1%–79.6%) classification accuracy compared with 283/492 (57.5%; 95% CI, 53.1%–61.8%) for initial clinical interpretations”

paper · Results, Section 3.2

“Categories requiring clinical context integration—BT-1a, BT-1b, and BT-3a—achieved high sensitivity (87.5%–100%), while categories determined primarily by volumetric thresholds—BT-2, BT-3c, and BT-4—showed moderate sensitivity (69.2%–74.8%)”

paper · Results, Section 3.3

What holds up

The modular multi-agent architecture effectively separates concerns: an extractor agent handles NLP-based clinical variable identification (steroid status, bevacizumab use, radiation date) with evidence-span linking for verification, while a scorer agent applies deterministic BT-RADS logic. This design supports transparency through schema-constrained generation and Pydantic validation. The use of an open-weight LLM (GPT-oss, 20B parameters) with local inference addresses privacy and cost barriers for clinical deployment. The substantial agreement with expert reference (Cohen's κ = 0.708; quadratic weighted κ = 0.803) demonstrates the system captures the ordinality of BT-RADS categories. The root cause analysis provides actionable insights: 44.1% of errors stem from threshold boundary ambiguity—a fundamentally irreducible challenge in categorizing continuous volumetric data.

“The extractor agent processes clinical notes to identify steroid status, bevacizumab status, and radiation therapy completion date. Schema-constrained generation with Pydantic validation ensures outputs conform to predefined categorical formats”

paper · Methods, Section 2.3

“Cohen's κ was 0.708 (95% CI, 0.667–0.750). Quadratic weighted κ was 0.803 (95% CI, 0.724–0.882)”

paper · Results, Section 3.2

“Root cause analysis attributed 52/118 (44.1%) to volumetric threshold errors, where volumetric changes fell near the ±20% stability or 40% worsening cutoffs”

paper · Results, Section 3.6

Main concerns

Several limitations warrant careful scrutiny. First, the comparison to initial clinical assessments may inflate perceived benefit: routine radiology reports often come from non-neuroradiologists with varying training, whereas the reference standard was a single fellowship-trained neuroradiologist. The system achieving 76% vs. ~80% human interrater reliability suggests it matches, but does not surpass, expert-level consistency. Second, the theoretical performance ceiling analysis (Table 4) constructs optimistic scenarios (88.1%, 99.4%) that are not empirically validated; these projections assume perfect extraction and algorithmic logic that may not be achievable. Third, exclusion of 17 cases (9 for baseline issues, 8 for segmentation quality) without sensitivity analysis on their characteristics raises selection bias concerns. Fourth, threshold-dependent categories (BT-3b particularly at 57.1% sensitivity) struggle when FLAIR and enhancement trends diverge—a known limitation of aggregate volumetric approaches that mask lesion-level heterogeneity. The acknowledgment that the BT-3b "enhancement priority rule did not fully align with expert interpretation" signals unaddressed algorithmic refinement needs.

“Initial clinical BT-RADS classifications were obtained from radiology reports generated during routine clinical workflows by radiologists with varying degrees of training and expertise (including both fellowship trained neuroradiologists and non-neuroradiologists)”

paper · Methods, Section 2.4

“Performance ceiling analysis: with perfect extraction, theoretical accuracy was 78.5%; with perfect algorithm logic, 82.8%; with both perfected, 88.1%. The theoretical maximum of 99.4% represents the scenario where only irreducible errors remain”

paper · Results, Section 3.6

“BT-3b had the lowest sensitivity at 12/21 (57.1%), reflecting difficulty in cases where enhancement and FLAIR volume trends diverge”

paper · Results, Section 3.3

“The BT-3b category presented particular difficulty when volumetric components exhibit opposite trends, where the enhancement priority rule did not fully align with expert interpretation”

paper · Discussion

Evidence and comparison

The evidence generally supports the primary claim that multi-agent automation outperforms routine clinical assessment, though the effect size (18.5 percentage points) should be interpreted in context of the comparison baseline. Performance stratification by category type is well-documented and clinically meaningful—high sensitivity for context-dependent categories validates the LLM extraction component, while moderate sensitivity for threshold-dependent categories reflects inherent measurement uncertainty near categorical boundaries. The comparison to Lee et al.'s NLP-only approach (F1 0.68–0.72) is fair but understates the advantage: this work integrates actual imaging volumetrics rather than relying solely on report text. The prior work by Zhang et al. on volumetric analysis alone is appropriately cited as showing "only moderate performance without subcategory differentiation, highlighting the necessity of clinical context integration." However, the paper does not directly compare against the referenced interrater agreement study (Essien et al., Gwet index 0.83) in its results framing—doing so would contextualize that 76% accuracy, while improved over routine practice, remains below optimal human agreement.

“Lee and colleagues achieved F1 scores of 0.68–0.72 for NLP-based BT-RADS inference from report text, but required an existing radiologist report. Zhang and colleagues showed that volumetric analysis alone achieved only moderate performance without subcategory differentiation”

paper · Discussion

“The comparable performance of the automated system (76%) and human interrater reliability (~80%) suggests that automated BT-RADS classification is achievable with high accuracy”

paper · Discussion

Reproducibility

Reproducibility is limited by several factors. The source code is "available upon request for academic research" rather than deposited in a public repository with version control. No training data, model weights, or inference pipelines are publicly accessible. Critical implementation details are insufficiently specified: exact prompt templates for the extractor agent, temperature and sampling parameters beyond "temperature 0.0," and specific Pydantic schema constraints are not provided. The CNN segmentation model is described as "previously described and validated" without citation to architecture details or training regimen in the main text. Hyperparameters for volumetric threshold decisions (±20%, >40%) are derived from BT-RADS standards but the rationale for edge-case handling (e.g., exactly 20% change) is not specified. The use of GPT-oss (20B)—a specific open-weight model—means performance may not generalize to other LLM implementations. The single-institution cohort from a high-volume brain tumor center raises questions about generalizability to community practice with different MRI protocols and case distributions.

“The classification system source code is available upon request for academic research”

paper · Data Availability

“Schema-constrained generation with Pydantic validation ensures outputs conform to predefined categorical formats”

paper · Methods, Section 2.3

“Performance reflects a specific LLM configuration (GPT-oss; temperature 0.0; schema constraints) and may not fully generalize to other implementations”

paper · Discussion

Abstract

The Brain Tumor Reporting and Data System (BT-RADS) standardizes post-treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification. A multi-agent LLM system combined with automated CNN-based tumor segmentation was retrospectively evaluated on 509 consecutive post-treatment glioma MRI examinations from a single high-volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT-RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board-certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%-79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%-61.8%) for initial clinical assessments (+18.5 percentage points; P<.001). Context-dependent categories showed high sensitivity (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), while threshold-dependent categories showed moderate sensitivity (BT-3c 74.8%, BT-2 69.2%, BT-4 69.3%, BT-3b 57.1%). For BT-4, positive predictive value was 92.9%. The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.