Benchmarking Recurrent Event-Based Object Detection for Industrial Multi-Class Recognition on MTEvent
This paper evaluates whether recurrent temporal modeling helps event-based object detection in industrial settings. The authors benchmark ReYOLOv8s (a recurrent ConvLSTM-augmented detector) against a vanilla YOLOv8s baseline on MTEvent, an industrial warehouse/factory dataset with 17 classes and severe class imbalance. The key question is whether memory across temporal clip lengths (3-21 frames) improves detection over single-window baselines.
The paper delivers a narrowly scoped but useful benchmark establishing that recurrent processing provides modest benefits (9.6% mAP50 gain) for industrial event-based detection. The strongest finding is that GEN1 pretraining (driving data) substantially outperforms scratch training (0.329 vs 0.285 mAP50), while PEDRo pretraining worsens results—indicating domain alignment matters more than event modality similarity. However, the absolute performance remains weak (mAP50 <0.33) and the study design has significant limitations that prevent stronger conclusions.
Three findings appear well-supported by the evidence: (1) recurrence provides modest but consistent gains over single-window detection when trained from scratch; (2) pretraining effects are highly domain-dependent—GEN1 initialization helps substantially, while PEDRo initialization hurts relative to scratch; and (3) the failure mode analysis identifies class imbalance and human-object interaction as persistent challenges. The controlled comparison within a fixed architecture family (non-recurrent vs. recurrent YOLOv8s) isolates the temporal modeling variable effectively.
The paper has three critical weaknesses. First, the statistical reliability is questionable: the authors explicitly state all results are from single training runs, meaning variance is unknown and comparisons near the ~0.01 mAP50 threshold are suspect. Second, the non-monotonic scratch performance (C7 better than C11, then C21 best) is left as speculation without analysis—this pattern is consistent with training instability rather than a true architectural limitation. Third, the architectural scope is narrow: no comparisons to transformers (RVT), no ablation of where ConvLSTM is placed in the backbone, and no analysis of why PEDRo pretraining fails (feature visualization or probing would help). The conclusion admits these limitations but does not temper the claims proportionally.
The evidence supports the central comparative claims about recurrence and pretraining, but the comparisons to related work are limited. The authors cite RVT (Gehrig & Scaramuzza 2023) as outside scope due to different input processing, yet this exclusion prevents readers from knowing whether the 0.329 mAP50 result is competitive. The MTEvent dataset lacks standardized splits, forcing the authors to use a custom 60/13/2 scene split with only 2 test scenes—too small for meaningful test-set reporting. The class imbalance problem is discussed descriptively but no standard mitigation techniques (focal loss, resampling) are attempted, leaving open whether the failure modes are dataset-inherent or method-inherent.
Reproducibility is partially addressed but has gaps. The authors provide training hyperparameters (AdamW, lr0=0.00181, cosine decay, batch size 16, image size 320) and specify event preprocessing (5 bins, 50ms windows, 256×320), enabling basic replication. However, the MTEvent dataset uses a non-standard custom split (60/13/2 scenes), and code is not mentioned as released. The ReYOLOv8s architecture is described at a high level (ConvLSTM at intermediate stages) but lacks specifics on which stages, hidden dimensionality, or state initialization. No random seed information or variance estimates are provided. Full reproduction would require contacting the authors for code and exact data splits.
Event cameras are attractive for industrial robotics because they provide high temporal resolution, high dynamic range, and reduced motion blur. However, most event-based object detection studies focus on outdoor driving scenarios or limited class settings. In this work, we benchmark recurrent ReYOLOv8s on MTEvent for industrial multi-class recognition and use a non-recurrent YOLOv8s variant as a baseline to analyze the effect of temporal memory. On the MTEvent validation split, the best scratch recurrent model (C21) reaches 0.285 mAP50, corresponding to a 9.6% relative improvement over the nonrecurrent YOLOv8s baseline (0.260). Event-domain pretraining has a stronger effect: GEN1-initialized fine-tuning yields the best overall result of 0.329 mAP50 at clip length 21, and unlike scratch training, GEN1-pretrained models improve consistently with clip length. PEDRo initialization drops to 0.251, indicating that mismatched source-domain pretraining can be less effective than training from scratch. Persistent failure modes are dominated by class imbalance and human-object interaction. Overall, we position this work as a focused benchmarking and analysis study of recurrent event-based detection in industrial environments.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.