Respiratory Status Detection with Video Transformers

cs.CV Thomas Savage, Evan Madill · Mar 22, 2026

What it does

Why it matters

An F1 score of 0. 81 is achieved, though on only 7 test videos from 3 participants.

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper investigates whether video transformers can detect respiratory distress from video recordings of post-exercise recovery. The authors frame the problem as a temporal ordering task—predicting which of two clips shows greater shortness of breath—and propose augmenting ViViT with Lie Relative Encodings (LieRE) and Motion-Guided Masking (MGM). An F1 score of 0.81 is achieved, though on only 7 test videos from 3 participants.

Critical review

Verdict

Bottom line

The paper presents a promising direction but suffers from severely limited evaluation scale and domain mismatch. While the technical approach combines recent advances (LieRE positional encodings + MGM) in a sensible way, the experimental validation on only 7 test videos from 3 healthy volunteers—representing an 18% test split of just 52 included participants—provides insufficient evidence for the claimed clinical applicability. The shift from temporal ordering to actual respiratory distress detection in hospitalized patients remains a significant, unvalidated leap.

“In total, 75 participants submitted video recordings. Fifty-two participants were included in the final analysis.”

Savage & Madill, Sec. 3.4 · Section 3.4

“The test set consisted of seven videos from three participants.”

Savage & Madill, Sec. 3.5 · Section 3.5

What holds up

The embedding-based comparison strategy demonstrates merit, achieving F1=0.75 compared to 0.58–0.69 for two-tower cross-attention approaches, while requiring substantially less training time. The addition of LieRE and MGM yields incremental improvements (F1 rising from 0.75 to 0.81), consistent with the claimed benefits of these methods for spatiotemporal modeling. The temporal-ordering formulation is a clever solution to the lack of fine-grained labels.

“Embedding dist achieved F1 0.75, TT-Full achieved 0.58, TT-CLS-Token achieved 0.69. LieRE + MGM achieved F1 0.81.”

Savage & Madill, Table 1 · Table 1

Main concerns

The test set size (7 videos, 3 participants) is critically small for drawing statistical conclusions. The 30% exclusion rate (23 of 75 participants excluded due to insufficient resolution, poor framing, or subjective breathlessness assessment) introduces selection bias. The use of healthy volunteers recovering from exercise—rather than actual patients with pathological respiratory distress—creates a substantial domain gap; the authors acknowledge perspiration as a confounding signal that may correlate with exercise recovery but not clinical dyspnea.

The claim of fully automated processing is undermined by participant self-positioning in controlled recordings. Critically, the paper lacks any ablation comparing the video transformer approach to simpler baselines (e.g., optical flow statistics, 2D CNNs, or even frame-level ResNet features), making it impossible to assess whether the architectural complexity is warranted. The performance drops to 55–65% accuracy for small temporal differences (Section 4), suggesting limited sensitivity for subtle but clinically relevant changes.

“Participants implicitly curated their recordings by positioning the camera and keeping their face and upper body in view. As a result, our evaluation does not fully represent a passive, continuous monitoring setting.”

Savage & Madill, Sec. 5.1 · Section 5.1

“when clips reflected only small differences in respiratory status, accuracy ranged from 55–65%”

Savage & Madill, Sec. 4 · Section 4 / Figure 1 analysis

Evidence and comparison

The comparison to Nawaz et al. is fair in noting the limitations of manual cropping, but the authors fail to compare their method against the actual clinical standard—human clinician assessment—or against simpler video architectures like TimeSformer or even frame-level models. The F1=0.81 metric lacks context: without knowing the class balance in the 7-video test set, this could be misleading. The relationship between temporal ordering performance and actual respiratory distress detection in clinical populations remains theoretical; the authors correctly note this as future work but the title and framing suggest clinical applicability that is not demonstrated.

“A key limitation of the Nawaz et al. approach was its reliance on manual video cropping and clinician-guided positioning prior to recording. In contrast, our experiment focuses on fully automated video processing without manual cropping.”

Savage & Madill, Sec. 2.2 · Section 2.2

Reproducibility

Reproducibility is severely limited. No code, dataset, or trained model weights are made available. Hyperparameters are sparsely documented: training capped at 5 epochs and <100 hours, but learning rate, batch size, optimizer, and warmup schedule are omitted. The LieRE implementation details (number of learned matrices, initialization) are unspecified. The participant exclusion criteria rely on subjective judgment from a single author ('author TS could not reliably determine the presence of shortness of breath'), which is neither replicable nor objective. The weak supervision approach using first/last third split introduces sensitivity to clip duration thresholds that is not characterized.

“author TS could not reliably determine the presence of shortness of breath after comparing the first 10 seconds of the video to the last 10 seconds”

Savage & Madill, Sec. 3.4 · Section 3.4

“training for each model was capped at <100 hours and a maximum of 5 epochs”

Savage & Madill, Sec. 3.3 · Section 3.3

Abstract

Recognition of respiratory distress through visual inspection is a life saving clinical skill. Clinicians can detect early signs of respiratory deterioration, creating a valuable window for earlier intervention. In this study, we evaluate whether recent advances in video transformers can enable Artificial Intelligence systems to recognize the signs of respiratory distress from video. We collected videos of healthy volunteers recovering after strenuous exercise and used the natural recovery of each participants respiratory status to create a labeled dataset for respiratory distress. Splitting the video into short clips, with earlier clips corresponding to more shortness of breath, we designed a temporal ordering challenge to assess whether an AI system can detect respiratory distress. We found a ViViT encoder augmented with Lie Relative Encodings (LieRE) and Motion Guided Masking, combined with an embedding based comparison strategy, can achieve an F1 score of 0.81 on this task. Our findings suggest that modern video transformers can recognize subtle changes in respiratory mechanics.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.