A Two-stage Transformer Framework for Temporal Localization of Distracted Driver Behaviors
This paper addresses temporal action localization (TAL) for distracted driver behaviors in untrimmed in-cabin videos, a critical task for intelligent transportation systems. The authors propose a two-stage framework combining VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module for multi-scale temporal modeling. The work targets deployment scenarios such as fleet management and transportation safety checkpoints, aiming to balance accuracy against computational constraints.
The framework achieves strong performance on the AI City Challenge 2024 Track 3 dataset (peak mAP of $92.67\%$) but represents an incremental contribution that assembles existing components—VideoMAE, AMA, and YOLOv5's SPPF—with minimal architectural novelty. While the SPPF enhancement consistently improves detection metrics across configurations, the paper's claims regarding computational efficiency are undermined by reliance on a ViT-Giant backbone requiring $1584.06$ GFLOPs per segment and $131.83$ hours of training time.
The empirical validation of the SPPF module provides solid evidence that multi-scale temporal aggregation improves localization robustness, particularly for variable-duration actions such as reaching behind or normal driving. The systematic comparison between ViT-Base and ViT-Giant backbones transparently characterizes the accuracy-efficiency trade-off, showing that SPPF gains hold across both scales with nearly $3\%$ mAP improvement for ViT-Giant and reduced variance across runs.
The efficiency claims contradict the paper's reliance on billion-parameter models unsuitable for real-time deployment, as the ViT-Giant backbone consumes $15\times$ more compute than the base variant. The framework lacks comparison with contemporary TAL methods such as ActionFormer or TriDet (cited in related work but absent from evaluation), making performance claims difficult to contextualize against the state-of-the-art. Furthermore, the ensemble strategy described in Section 3.2.6 is not evaluated in the results tables, raising questions about whether reported metrics use single models or ensembles.
Internal comparisons (SPPF vs. Identity neck, ViT-Giant vs. ViT-Base) are well-supported by multiple training runs with standard deviation reporting, though the absence of statistical significance testing weakens comparative claims. The AI City Challenge 2024 Track 3 dataset provides a realistic testbed with $16$ action classes across three camera angles; however, single-dataset evaluation limits generalization claims. Class-wise analysis reveals fundamental limitations on fine-grained facial actions (yawning, singing) where the model struggles to distinguish visually similar behaviors, acknowledging that pure RGB approaches may be insufficient for these categories.
Experimental details are thorough: the AI City Challenge dataset is publicly available, hyperparameters are specified (AdamW optimizer, learning rates $10^{-3}$/$10^{-4}$, cosine schedule), and hardware is documented (NVIDIA RTX A5000, $24$GB VRAM). However, code availability is not explicitly stated, and the batch size of $1$ with unspecified gradient accumulation may hinder reproduction on different hardware configurations. While the authors note using a fixed random seed and enabling CUDA deterministic mode, the specific seed value is omitted, and preprocessing scripts referenced in Section 3.1.3 are not provided.
The identification of hazardous driving behaviors from in-cabin video streams is essential for enhancing road safety and supporting the detection of traffic violations and unsafe driver actions. However, current temporal action localization techniques often struggle to balance accuracy with computational efficiency. In this work, we develop and evaluate a temporal action localization framework tailored for driver monitoring scenarios, particularly suitable for periodic inspection settings such as transportation safety checkpoints or fleet management assessment systems. Our approach follows a two-stage pipeline that combines VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector, enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module to capture multi-scale temporal features. Experimental results reveal a distinct trade-off between model capacity and efficiency. At the feature extraction stage, the ViT-Giant backbone delivers higher representations with 88.09% Top-1 test accuracy, while the ViT-based variant proves to be a practical alternative, achieving 82.55% accuracy with significantly lower computational fine-tuning costs (101.85 GFLOPs/segment compared to 1584.06 GFLOPs/segment for Giant). In the downstream localization task, the integration of SPPF consistently improves performance across all configurations. Notably, the ViT-Giant + SPPF model achieves a peak mAP of 92.67%, while the lightweight ViT-based configuration maintains robust results.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.