A Two-stage Transformer Framework for Temporal Localization of Distracted Driver Behaviors

cs.CV cs.AI Gia-Bao Doan, Nam-Khoa Huynh, Minh-Nhat-Huy Ho, Khanh-Thanh-Khoa Nguyen, Thanh-Hai Le · Mar 22, 2026

What it does

Why it matters

The authors propose a two-stage framework combining VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module for multi-scale temporal modeling. The work...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper addresses temporal action localization (TAL) for distracted driver behaviors in untrimmed in-cabin videos, a critical task for intelligent transportation systems. The authors propose a two-stage framework combining VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module for multi-scale temporal modeling. The work targets deployment scenarios such as fleet management and transportation safety checkpoints, aiming to balance accuracy against computational constraints.

Critical review

Verdict

Bottom line

The framework achieves strong performance on the AI City Challenge 2024 Track 3 dataset (peak mAP of $92.67\%$) but represents an incremental contribution that assembles existing components—VideoMAE, AMA, and YOLOv5's SPPF—with minimal architectural novelty. While the SPPF enhancement consistently improves detection metrics across configurations, the paper's claims regarding computational efficiency are undermined by reliance on a ViT-Giant backbone requiring $1584.06$ GFLOPs per segment and $131.83$ hours of training time.

“the ViT-Giant + SPPF model achieves a peak mAP of 92.67%”

paper · Abstract

“ViT-Giant requires 1584.06 GFLOPs per segment—more than 15× the cost of ViT-Base”

paper · Section 4.2.1

What holds up

The empirical validation of the SPPF module provides solid evidence that multi-scale temporal aggregation improves localization robustness, particularly for variable-duration actions such as reaching behind or normal driving. The systematic comparison between ViT-Base and ViT-Giant backbones transparently characterizes the accuracy-efficiency trade-off, showing that SPPF gains hold across both scales with nearly $3\%$ mAP improvement for ViT-Giant and reduced variance across runs.

“SPPF improves mean mAP by nearly 3% while simultaneously leading to a pronounced reduction in variance across runs”

paper · Section 4.2.2

Main concerns

The efficiency claims contradict the paper's reliance on billion-parameter models unsuitable for real-time deployment, as the ViT-Giant backbone consumes $15\times$ more compute than the base variant. The framework lacks comparison with contemporary TAL methods such as ActionFormer or TriDet (cited in related work but absent from evaluation), making performance claims difficult to contextualize against the state-of-the-art. Furthermore, the ensemble strategy described in Section 3.2.6 is not evaluated in the results tables, raising questions about whether reported metrics use single models or ensembles.

“The average training time per epoch increases from 31.35 minutes to 169.21 minutes, with total training time expanding from roughly one day to more than five days”

paper · Section 4.2.1

“In this work, we employ an ensemble strategy that integrates predictions from two complementary feature extractors—VideoMAE and VideoMAE V2”

paper · Section 3.2.6

Evidence and comparison

Internal comparisons (SPPF vs. Identity neck, ViT-Giant vs. ViT-Base) are well-supported by multiple training runs with standard deviation reporting, though the absence of statistical significance testing weakens comparative claims. The AI City Challenge 2024 Track 3 dataset provides a realistic testbed with $16$ action classes across three camera angles; however, single-dataset evaluation limits generalization claims. Class-wise analysis reveals fundamental limitations on fine-grained facial actions (yawning, singing) where the model struggles to distinguish visually similar behaviors, acknowledging that pure RGB approaches may be insufficient for these categories.

“the model struggles with subtle, localized behaviors such as 'Yawning,' 'Talking,' or 'Singing.' These actions rely heavily on facial micro-expressions and mouth movements, which may be overshadowed by global body features”

paper · Section 4.3

Reproducibility

Experimental details are thorough: the AI City Challenge dataset is publicly available, hyperparameters are specified (AdamW optimizer, learning rates $10^{-3}$/$10^{-4}$, cosine schedule), and hardware is documented (NVIDIA RTX A5000, $24$GB VRAM). However, code availability is not explicitly stated, and the batch size of $1$ with unspecified gradient accumulation may hinder reproduction on different hardware configurations. While the authors note using a fixed random seed and enabling CUDA deterministic mode, the specific seed value is omitted, and preprocessing scripts referenced in Section 3.1.3 are not provided.

“A fixed random seed is used across all experiments. CUDA deterministic mode is enabled”

paper · Section 4.1.1

“Batch size: 1 (GPU memory constrained)”

paper · Section 4.1.1

Abstract

The identification of hazardous driving behaviors from in-cabin video streams is essential for enhancing road safety and supporting the detection of traffic violations and unsafe driver actions. However, current temporal action localization techniques often struggle to balance accuracy with computational efficiency. In this work, we develop and evaluate a temporal action localization framework tailored for driver monitoring scenarios, particularly suitable for periodic inspection settings such as transportation safety checkpoints or fleet management assessment systems. Our approach follows a two-stage pipeline that combines VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector, enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module to capture multi-scale temporal features. Experimental results reveal a distinct trade-off between model capacity and efficiency. At the feature extraction stage, the ViT-Giant backbone delivers higher representations with 88.09% Top-1 test accuracy, while the ViT-based variant proves to be a practical alternative, achieving 82.55% accuracy with significantly lower computational fine-tuning costs (101.85 GFLOPs/segment compared to 1584.06 GFLOPs/segment for Giant). In the downstream localization task, the integration of SPPF consistently improves performance across all configurations. Notably, the ViT-Giant + SPPF model achieves a peak mAP of 92.67%, while the lightweight ViT-based configuration maintains robust results.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.