MARCUS: An agentic, multimodal vision-language model for cardiac diagnosis and management
MARCUS tackles the bottleneck of human interpretation in cardiovascular diagnosis by creating an agentic, multimodal vision-language model that jointly reasons over raw ECG signals, echocardiogram videos, and cardiac MRI. The core innovation is a hierarchical architecture where modality-specific expert encoders feed into an orchestrating agent that synthesizes findings while resisting 'mirage reasoning'—the tendency of VLMs to confabulate explanations without actually processing the image. Trained on 13.5 million clinical images and 1.6 million expert-curated Q&A pairs, MARCUS aims to bridge the gap between single-task diagnostic AI and interactive clinical reasoning.
MARCUS delivers strong empirical results with a 34–45 percentage point accuracy gain over GPT-5 Thinking and Gemini 2.5 Pro across single modalities, widening to >40 points on multimodal tasks (70% vs. 22–28%). The agentic orchestrator successfully detects and mitigates mirage reasoning, achieving a 0% mirage rate through counterfactual probing compared to 33–38% for standalone expert models. However, the single-center training (Stanford), retrospective design, and markedly lower echocardiography accuracy (67% internal) compared to ECG/CMR (87–88%) temper claims of clinical readiness.
The external validation at UCSF—distinct equipment and patient demographics—demonstrates generalizability with stable ECG (91% vs. 87%) and CMR (85% vs. 88%) performance. The evaluation methodology is rigorous: B-Clean filtering excludes questions answerable without images, ensuring the benchmark measures genuine visual reasoning. The counterfactual probing protocol (three rephrased queries + image-absent probe with Jaccard similarity scoring) provides a reusable framework for detecting hallucination in medical VLMs. The 1.7–3.0× improvement in free-text response quality (Likert scores) suggests clinical utility beyond multiple-choice accuracy.
Training data derives from a single academic center despite external validation, risking selection bias. The echocardiography accuracy (67.4% internal) lags ECG and CMR by ~20 percentage points, which the authors attribute to 'operator-dependence of ultrasound'—but this gap suggests the visual encoders may not have learned acquisition-invariant features. The Likert evaluation relies partially on LLM-based scoring (not just expert cardiologists), introducing potential circularity when comparing against other LLMs. The multimodal benchmark relies on only 38–40 MCQs, a small sample for claims of 70% accuracy. Several citations reference an unpublished 'companion paper' (Mirage) with placeholder arXiv IDs, making independent verification of mirage methodology impossible.
The comparison frontier models received identical prompts and images, satisfying fair comparison criteria. However, the paper omits comparison against established medical imaging baselines (e.g., EchoNext for ECG, EchoPrime for echocardiography) beyond passing mentions, focusing only on general-purpose VLMs. The claimed 34–45% improvement is over GPT-5 and Gemini, but these models are not fine-tuned on medical data—comparing against domain-specific competitors would shrink this margin. The discrepancy between Stanford (67.4%) and UCSF (86.0%) echo accuracy could indicate dataset shift rather than robustness, though the authors interpret it positively.
The paper provides comprehensive hyperparameters across three training stages (learning rates, batch sizes, GRPO group size 4, KL coefficient 0.01) and training duration estimates (4–72 hours per modality on H100s). The code and model weights are publicly released under MIT license at https://github.com/AshleyLab/MARCUS, and the benchmark questions are provided. Raw clinical data appropriately remains protected under IRB. However, independent reproduction would require substantial computational resources (8× H100 80GB GPUs for stages 2–3) and the 13.5 million image dataset is not publicly available, making full reproduction impossible for most researchers. The LLM judge used for Likert scoring is not specified, raising reproducibility concerns for subjective quality metrics.
Cardiovascular disease remains the leading cause of global mortality, with progress hindered by human interpretation of complex cardiac tests. Current AI vision-language models are limited to single-modality inputs and are non-interactive. We present MARCUS (Multimodal Autonomous Reasoning and Chat for Ultrasound and Signals), an agentic vision-language system for end-to-end interpretation of electrocardiograms (ECGs), echocardiograms, and cardiac magnetic resonance imaging (CMR) independently and as multimodal input. MARCUS employs a hierarchical agentic architecture comprising modality-specific vision-language expert models, each integrating domain-trained visual encoders with multi-stage language model optimization, coordinated by a multimodal orchestrator. Trained on 13.5 million images (0.25M ECGs, 1.3M echocardiogram images, 12M CMR images) and our novel expert-curated dataset spanning 1.6 million questions, MARCUS achieves state-of-the-art performance surpassing frontier models (GPT-5 Thinking, Gemini 2.5 Pro Deep Think). Across internal (Stanford) and external (UCSF) test cohorts, MARCUS achieves accuracies of 87-91% for ECG, 67-86% for echocardiography, and 85-88% for CMR, outperforming frontier models by 34-45% (P<0.001). On multimodal cases, MARCUS achieved 70% accuracy, nearly triple that of frontier models (22-28%), with 1.7-3.0x higher free-text quality scores. Our agentic architecture also confers resistance to mirage reasoning, whereby vision-language models derive reasoning from unintended textual signals or hallucinated visual content. MARCUS demonstrates that domain-specific visual encoders with an agentic orchestrator enable multimodal cardiac interpretation. We release our models, code, and benchmark open-source.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.