MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management

cs.AI Jack W O'Sullivan, Mohammad Asadi, Lennart Elbe, Akshay Chaudhari, Tahoura Nedaee, Francois Haddad, Michael Salerno, Li Fe-Fei, Ehsan Adeli, Rima Arnaout, Euan A Ashley · Mar 23, 2026

What it does

Why it matters

5 million clinical images and 1. 6 million expert-curated Q&A pairs, MARCUS aims to bridge the gap between single-task diagnostic AI and interactive clinical reasoning.

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

MARCUS tackles the bottleneck of human interpretation in cardiovascular diagnosis by creating an agentic, multimodal vision-language model that jointly reasons over raw ECG signals, echocardiogram videos, and cardiac MRI. The core innovation is a hierarchical architecture where modality-specific expert encoders feed into an orchestrating agent that synthesizes findings while resisting 'mirage reasoning'—the tendency of VLMs to confabulate explanations without actually processing the image. Trained on 13.5 million clinical images and 1.6 million expert-curated Q&A pairs, MARCUS aims to bridge the gap between single-task diagnostic AI and interactive clinical reasoning.

Critical review

Verdict

Bottom line

MARCUS delivers strong empirical results with a 34–45 percentage point accuracy gain over GPT-5 Thinking and Gemini 2.5 Pro across single modalities, widening to >40 points on multimodal tasks (70% vs. 22–28%). The agentic orchestrator successfully detects and mitigates mirage reasoning, achieving a 0% mirage rate through counterfactual probing compared to 33–38% for standalone expert models. However, the single-center training (Stanford), retrospective design, and markedly lower echocardiography accuracy (67% internal) compared to ECG/CMR (87–88%) temper claims of clinical readiness.

“When presented with image-absent queries, the ECG, echocardiography, and CMR expert models generated mirages in 33.0%, 38.5%, and 36.4% of cases on average, respectively... when MARCUS was used to process multimodal data and thus had the full agentic pipeline engaged, the orchestrator's counterfactual probes successfully identified all occurrences of mirages... MARCUS achieved a mirage rate of 0% across all modalities”

MARCUS paper · Section 1.6

“MARCUS accuracy: ECG 87–91%, Echo 67–86%, CMR 85–88%, Multimodal 70.0%; GPT-5: 22.5%, Gemini 2.5 Pro: 27.5% on multimodal”

MARCUS paper · Figure 2A

What holds up

The external validation at UCSF—distinct equipment and patient demographics—demonstrates generalizability with stable ECG (91% vs. 87%) and CMR (85% vs. 88%) performance. The evaluation methodology is rigorous: B-Clean filtering excludes questions answerable without images, ensuring the benchmark measures genuine visual reasoning. The counterfactual probing protocol (three rephrased queries + image-absent probe with Jaccard similarity scoring) provides a reusable framework for detecting hallucination in medical VLMs. The 1.7–3.0× improvement in free-text response quality (Likert scores) suggests clinical utility beyond multiple-choice accuracy.

“Questions that any evaluated model answered correctly without image access were excluded from the primary performance comparisons, following the B-Clean framework... This filtered evaluation retained 60% of the original question set”

MARCUS paper · Section 3.8

“External validation: ECG 91% (Stanford 87%), CMR 85% (Stanford 88%); MARCUS significantly outperformed GPT-5 in both environments (P<0.001)”

MARCUS paper · Supplementary Table 3

Main concerns

Training data derives from a single academic center despite external validation, risking selection bias. The echocardiography accuracy (67.4% internal) lags ECG and CMR by ~20 percentage points, which the authors attribute to 'operator-dependence of ultrasound'—but this gap suggests the visual encoders may not have learned acquisition-invariant features. The Likert evaluation relies partially on LLM-based scoring (not just expert cardiologists), introducing potential circularity when comparing against other LLMs. The multimodal benchmark relies on only 38–40 MCQs, a small sample for claims of 70% accuracy. Several citations reference an unpublished 'companion paper' (Mirage) with placeholder arXiv IDs, making independent verification of mirage methodology impossible.

“Our training data was derived from a single centre... echocardiographic accuracy exceeded large frontier models, it was lower than ECG and CMR. This may reflect the greater heterogeneity and operator-dependence of ultrasound imaging”

MARCUS paper · Section 2 (Discussion)

“Multimodal MCQ Questions: 40 items”

MARCUS paper · Supplementary Table 1

Evidence and comparison

The comparison frontier models received identical prompts and images, satisfying fair comparison criteria. However, the paper omits comparison against established medical imaging baselines (e.g., EchoNext for ECG, EchoPrime for echocardiography) beyond passing mentions, focusing only on general-purpose VLMs. The claimed 34–45% improvement is over GPT-5 and Gemini, but these models are not fine-tuned on medical data—comparing against domain-specific competitors would shrink this margin. The discrepancy between Stanford (67.4%) and UCSF (86.0%) echo accuracy could indicate dataset shift rather than robustness, though the authors interpret it positively.

“MARCUS achieved an accuracy of 87% on the internal Stanford held-out test set and 91% on the external UCSF cohort... frontier model accuracy: 35–48%”

MARCUS paper · Section 1.1

“There was a difference in accuracy between the internal and external cohort for echocardiograms: 67.4% vs 86.0%, P<0.05”

MARCUS paper · Section 1.4

Reproducibility

The paper provides comprehensive hyperparameters across three training stages (learning rates, batch sizes, GRPO group size 4, KL coefficient 0.01) and training duration estimates (4–72 hours per modality on H100s). The code and model weights are publicly released under MIT license at https://github.com/AshleyLab/MARCUS, and the benchmark questions are provided. Raw clinical data appropriately remains protected under IRB. However, independent reproduction would require substantial computational resources (8× H100 80GB GPUs for stages 2–3) and the 13.5 million image dataset is not publicly available, making full reproduction impossible for most researchers. The LLM judge used for Likert scoring is not specified, raising reproducibility concerns for subjective quality metrics.

“Stage 1: LR 2.0e-4, 4–8 H100s, 3 epochs; Stage 2 (SFT): LR 1.0e-5, 8 H100s, 5 epochs; Stage 3 (GRPO): LR 1.0e-6, group size 4, KL coeff 0.01, 8 H100s, 15 epochs”

MARCUS paper · Supplementary Table 3 (Hyperparameters)

“All source code for MARCUS is publicly available at https://github.com/AshleyLab/MARCUS under an MIT license... Raw clinical imaging data from Stanford and UCSF are protected under institutional data use agreements and are not publicly available”

MARCUS paper · Data availability statement

Abstract

Cardiovascular disease remains the leading cause of global mortality, with progress hindered by human interpretation of complex cardiac tests. Current AI vision-language models are limited to single-modality inputs and are non-interactive. We present MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals), an agentic vision-language system for end-to-end interpretation of electrocardiograms (ECGs), echocardiograms, and cardiac magnetic resonance imaging (CMR) independently and as multimodal input. MARCUS employs a hierarchical agentic architecture comprising modality-specific vision-language expert models, each integrating domain-trained visual encoders with multi-stage language model optimization, coordinated by a multimodal orchestrator. Trained on 13.5 million images (0.25M ECGs, 1.3M echocardiogram images, 12M CMR images) and our novel expert-curated dataset spanning 1.6 million questions, MARCUS achieves state-of-the-art performance surpassing frontier models (GPT-5 Thinking, Gemini 2.5 Pro Deep Think). Across internal (Stanford) and external (UCSF) test cohorts, MARCUS achieves accuracies of 87-91% for ECG, 67-86% for echocardiography, and 85-88% for CMR, outperforming frontier models by 34-45% (P<0.001). On multimodal cases, MARCUS achieved 70% accuracy, nearly triple that of frontier models (22-28%), with 1.7-3.0x higher free-text quality scores. Our agentic architecture also confers resistance to mirage reasoning, whereby vision-language models derive reasoning from unintended textual signals or hallucinated visual content. MARCUS demonstrates that domain-specific visual encoders with an agentic orchestrator enable multimodal cardiac interpretation. We release our models, code, and benchmark open-source.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.