LiFR-Seg: Anytime High-Frame-Rate Segmentation via Event-Guided Propagation
This paper introduces a new computer vision task called Anytime Interframe Semantic Segmentation: predicting dense semantic segmentation at arbitrary timestamps between low-frame-rate RGB frames using only a past frame and asynchronous event data. The core idea is feature propagation via event-driven motion fields rather than direct multi-modal fusion. The method is motivated by the perceptual gaps created by LFR cameras in high-speed autonomous driving scenarios, where critical events (e.g., pedestrians entering paths) may be missed between frames.
The paper presents a well-motivated and technically sound contribution. The core claim—that an LFR RGB + event system can match HFR RGB performance—is supported by strong empirical results on DSEC (73.82% vs 73.91% mIoU, gap of 0.09%). The method's three key components (event-driven motion field with learned confidence, uncertainty-guided Softmax Splatting, and temporal memory) form a coherent architecture. The anytime capability is convincingly demonstrated through controlled experiments on the synthetic SHF-DSEC dataset across varying temporal gaps.
The feature-level warping design choice is well-validated through ablation (Table 4: 73.82% vs 72.37% image warping vs 71.63% segmentation warping). The uncertainty-aware mechanism using learned log-precision map $S$ consistently improves results (+1.08% on DSEC, Table 3). The memory module shows compelling gains for long temporal gaps (+2.22% at 800ms, Table 5). The zero-shot night evaluation (41.86% vs HFR's 41.83%) is a particularly strong result demonstrating the robustness of event-based motion cues when RGB degrades.
The HFR 'upper bound' uses the same SegFormer-B2 backbone, but this may not represent the true ceiling—an HFR-specific architecture could perform better. The impressive DSEC-Night result (surpassing HFR) relies on zero-shot evaluation where the HFR model was presumably not fine-tuned on night data, making the comparison potentially unfair. The synthetic SHF-DSEC dataset uses simulated events via threshold-based triggering, which may not capture real event camera noise characteristics accurately.
The claim of being the 'first' anytime-capable causal method (Figure 3d vs 3a-c) is technically accurate but relies on a specific task formulation. Video frame interpolation methods like TimeLens-XL are dismissed as 'non-causal,' yet the 'causality' constraint itself is largely task-defined rather than arising from fundamental physical limitations.
The comparison to fusion methods (CMNeXt, EISNet) is fair and the performance gap is substantial (+3.69% over CMNeXt on DSEC). However, the interpolation baseline (TLX + Seg.) is arguably weaker than necessary—hybrid approaches that interpolate features rather than pixels might perform better. The claim that interpolation suffers from a 'PSNR-mIoU Paradox' is interesting but underexplored; only one interpolation method is tested. The M3ED results show robustness to dynamic ego-motion but limited diversity in scene types (only drone and quadruped).
The paper commits to releasing code and datasets upon acceptance. Training hyperparameters are specified: AdamW optimizer with lr=1e-4, weight decay=5e-3, polynomial decay, 200 epochs, batch size 4 on 2×RTX 4090 GPUs. The backbone (SegFormer-B2) and optical flow architecture (RAFT-like) are standard. Key missing details: exact architecture of ScoreNet (only described as 'composition of three main stages'), whether events from $t-\Delta t$ to $t$ are actually used (mentioned in §3.1 formulation but implementation details focus on $E_{t \to t+\delta t}$), and the temporal span of the memory bank. The DSEC dataset is public but requires preprocessing for segmentation labels.
Dense semantic segmentation in dynamic environments is fundamentally limited by the low-frame-rate (LFR) nature of standard cameras, which creates critical perceptual gaps between frames. To solve this, we introduce Anytime Interframe Semantic Segmentation: a new task for predicting segmentation at any arbitrary time using only a single past RGB frame and a stream of asynchronous event data. This task presents a core challenge: how to robustly propagate dense semantic features using a motion field derived from sparse and often noisy event data, all while mitigating feature degradation in highly dynamic scenes. We propose LiFR-Seg, a novel framework that directly addresses these challenges by propagating deep semantic features through time. The core of our method is an uncertainty-aware warping process, guided by an event-driven motion field and its learned, explicit confidence. A temporal memory attention module further ensures coherence in dynamic scenarios. We validate our method on the DSEC dataset and a new high-frequency synthetic benchmark (SHF-DSEC) we contribute. Remarkably, our LFR system achieves performance (73.82% mIoU on DSEC) that is statistically indistinguishable from an HFR upper-bound (within 0.09%) that has full access to the target frame. This work presents a new, efficient paradigm for achieving robust, high-frame-rate perception with low-frame-rate hardware. Project Page: https://candy-crusher.github.io/LiFR_Seg_Proj/#; Code: https://github.com/Candy-Crusher/LiFR-Seg.git.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.