Deriving Health Metrics from the Photoplethysmogram: Benchmarks and Insights from MIMIC-III-Ext-PPG

cs.LG eess.SP Mohammad Moulaeifard, Philip J. Aston, Peter H. Charlton, Nils Strodthoff · Mar 23, 2026

What it does

Why it matters

96) with strong cross-dataset generalizability, alongside the first systematic assessment of fine-grained arrhythmia classification from PPG alone. It matters because PPG sensors are ubiquitous in wearables and ICUs, yet standardized,...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper establishes a comprehensive benchmark for photoplethysmography (PPG)-based clinical prediction using the large-scale MIMIC-III-Ext-PPG dataset, evaluating multi-task learning across arrhythmia classification (13 classes) and physiological regression (blood pressure, heart rate, respiratory rate). The core contribution is demonstrating robust atrial fibrillation detection (AUROC 0.96) with strong cross-dataset generalizability, alongside the first systematic assessment of fine-grained arrhythmia classification from PPG alone. It matters because PPG sensors are ubiquitous in wearables and ICUs, yet standardized, large-scale, multi-task benchmarks have been lacking, hindering meaningful algorithm comparison and clinical deployment.

Critical review

Verdict

Bottom line

The paper delivers a rigorous and valuable benchmarking study leveraging over 6.3 million PPG segments from 6,189 ICU patients, with strong experimental design including external validation and extensive subgroup stratification. The AF detection performance (AUROC 0.96–0.97) is compelling and well-supported. However, the authors' interpretation that performance variations across demographic subgroups reflect "population-specific waveform differences rather than systematic bias" lacks causal evidence and could obscure potential model inequities. Furthermore, while the authors candidly report wide limits of agreement for blood pressure estimation (approximately $\\pm$40 mmHg for SBP), the clinical utility of these regression results remains questionable for patient monitoring.

“These variations appear to reflect population-specific waveform differences rather than systematic bias in model behavior.”

paper · Abstract

“Performance was worst for SBP, with LoAs of approximately \\pm 40 mmHg, which is considerably larger than the interquartile range of SBP values in the dataset (median 121.3 mmHg, IQR 105.6–139.6 mmHg).”

paper · Section III-B

What holds up

The multi-task benchmarking framework is methodologically sound, with clear task definitions (AF, sinus vs. atrial arrhythmia, and 13-class comprehensive classification) and thorough subgroup analyses across blood pressure, heart rate, BMI, gender, and ethnicity. The cross-dataset validation on Liu et al. demonstrates excellent AF detection generalizability (AUROC 0.97), confirming algorithm robustness. The authors' transparency regarding poor performance on rare rhythms—such as sinus arrhythmia (SARRH, AUROC 0.61) and limited BP estimation reliability—strengthens the scientific credibility of the work.

“Cross-dataset validation demonstrates excellent generalizability for AF detection (AUROC = 0.97).”

paper · Abstract

“SARRH ... 0.61 ... JR ... 0.62”

paper · Table II

Main concerns

The primary concern is the clinical interpretation of blood pressure estimation results. While the authors appropriately label BP prediction "unreliable" due to wide limits of agreement ($\\pm$40 mmHg for SBP), presenting these as benchmark baselines without stronger warnings about potential patient safety risks in deployment contexts is problematic. The attribution of subgroup performance disparities solely to physiological waveform differences, rather than considering dataset representation bias or algorithmic fairness issues, constitutes an unsupported causal leap. Additionally, the study is limited to ICU populations with 30-second segments, restricting generalizability to ambulatory settings where motion artifacts dominate and paroxysmal arrhythmias may be missed.

“cuffless BP estimation based on PPG alone using these techniques is currently unreliable.”

paper · Section IV

“our observation is limited to ICU patients, whose pathophysiology may limit generalizability to ambulatory or healthy populations. Also, the use of the 30-second segment length ... may be insufficient for detecting paroxysmal arrhythmias.”

paper · Section IV

Evidence and comparison

The evidence robustly supports the claim that AF detection performance generalizes better to large diverse populations than algorithms validated only on small cohorts, with fair comparisons to studies achieving AUROC $\\sim$0.99 on $N<50$ patients versus their $\\sim$0.95 on 6,189 patients. However, the comparison of blood pressure estimation against prior work is acknowledged by the authors to be problematic due to differing dataset splits and preprocessing pipelines. A notable omission is the lack of comparison against simple non-deep-learning baselines (e.g., peak counting for HR or traditional signal processing for RR), which would clarify the marginal value added by deep learning architectures for these specific regression tasks.

“While high AUROC (\\sim 0.99–1.0) is often achieved on limited cohorts (e.g., N<50 ...), performance often drops significantly when scaling up (e.g., AUROC 0.72 ...).”

paper · Section IV

“All model comparisons that do not rely on an identical dataset and splits have to be taken with a grain of salt.”

paper · Section IV

Reproducibility

Reproducibility is strong: both the dataset (MIMIC-III-Ext-PPG via PhysioNet) and source code (GitHub) are publicly available. Experimental details are comprehensive, including the 10-fold stratified split (7/1/2 train/val/test), hyperparameters (batch size 512, learning rate 0.001, 50 epochs, AdamW optimizer), and preprocessing (NeuroKit2 ppg_clean with Butterworth bandpass 0.5–8 Hz). The only minor gaps are the absence of specified random seeds and exact computational environment details, though the provided code should enable independent reproduction given sufficient computational resources to process 6.3 million segments.

“We leveraged the provided 10 folds... The first seven folds were used for training, the eighth fold for validation, and the ninth and tenth folds for testing... effective batch size of 512... learning rates were set to 0.001... trained for 50 epochs... AdamW optimizer.”

paper · Section II-B

“The source code underlying our investigations is available at https://github.com/AI4HealthUOL/MIMIC-III-Ext-PPG_benchmarking.”

paper · Code Availability

Abstract

Photoplethysmography (PPG) is one of the most widely captured biosignals for clinical prediction tasks, yet PPG-based algorithms are typically trained on small-scale datasets of uncertain quality, which hinders meaningful algorithm comparisons. We present a comprehensive benchmark for PPG-based clinical prediction using the \dbname~dataset, establishing baselines across the full spectrum of clinically relevant applications: multi-class heart rhythm classification, and regression of physiological parameters including respiratory rate (RR), heart rate (HR), and blood pressure (BP). Most notably, we provide the first comprehensive assessment of PPG for general arrhythmia detection beyond atrial fibrillation (AF) and atrial flutter (AFLT), with performance stratified by BP, HR, and demographic subgroups. Using established deep learning architectures, we achieved strong performance for AF detection (AUROC = 0.96) and accurate physiological parameter estimation (RR MAE: 2.97 bpm; HR MAE: 1.13 bpm; SBP/DBP MAE: 16.13/8.70 mmHg). Cross-dataset validation demonstrates excellent generalizability for AF detection (AUROC = 0.97), while clinical subgroup analysis reveals marked performance differences across subgroups by BP, HR, and demographic strata. These variations appear to reflect population-specific waveform differences rather than systematic bias in model behavior. This framework establishes the first integrated benchmark for multi-task PPG-based clinical prediction, demonstrating that PPG signals can effectively support multiple simultaneous monitoring tasks and providing essential baselines for future algorithm development.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.