GenOpticalFlow: A Generative Approach to Unsupervised Optical Flow Learning
Optical flow estimation traditionally requires expensive ground-truth annotations or relies on unreliable brightness constancy assumptions that fail under occlusion and illumination changes. This paper introduces GenOpticalFlow, a framework that synthesizes perfectly aligned training pairs by using monocular depth estimates to generate pseudo-optical flow, then conditioning a latent diffusion model to render corresponding next frames. The core innovation is converting unsupervised optical flow learning into a supervised training paradigm using synthetic data with geometrically consistent motion fields, potentially eliminating the need for manual annotation at scale.
The paper presents a creative generative approach that successfully bridges the gap between unsupervised and supervised optical flow estimation. By leveraging pretrained depth models and diffusion-based generation, the authors demonstrate consistent improvements across seven diverse architectures. However, the method's effectiveness is bounded by the accuracy of monocular depth estimation and the limited diversity of simulated camera motions (restricted to horizontal translations), which may not capture the full complexity of real-world dynamics.
The optical-flow-aware coordinate embedding mechanism (Eq. 1) effectively enforces geometric alignment by decomposing flow into canonical and warped coordinate systems $C_t, C_{t+1}$, achieving PSNR 17.85 and SSIM 0.552 on KITTI 2015 compared to 15.68/0.481 for the strongest baseline (StereoDiffusion). The inconsistent pixel filtering strategy (Eq. 10) demonstrably improves robustness, reducing EPE from 9.29 to 8.18 when using threshold $Z=30$ (Table 4). Most compelling is the consistent cross-architecture improvement: across RAFT, FlowFormer, FlowFormer++, GMFlowNet, FlowDiffuser, and WAFT, the framework reduces average EPE by 1.49 and Fl-all by 7.00 (Table 2).
The approach exhibits a critical dependency chain: errors in monocular depth estimation ($D_t = \mathcal{D}(I_t)$ in Eq. 3) propagate directly into the synthetic flow fields $\tilde{F}_{t\to t+1}$, yet the depth model remains fixed without domain adaptation. The virtual camera transformation (Sec. 3.3) is artificially constrained to random horizontal translations $d \sim U(0.8, 1.2)$, ignoring rotational motion, dynamic objects, and non-rigid deformations that dominate real optical flow. Furthermore, the synthetic dataset size ($N=5,000$) appears insufficient compared to typical optical flow training corpora, and generation quality metrics (PSNR < 21) indicate significant pixel-level misalignment persists. The inconsistent pixel filtering inevitably excises valid but challenging regions, potentially creating a biased training distribution that avoids hard cases like occlusions.
The evidence convincingly demonstrates that synthetic data improves upon unsupervised pretraining across multiple architectures (Table 2), supported by ablations validating coordinate embeddings and cross-view attention (Table 3). However, the evaluation omits direct comparison to fully supervised upper bounds or recent semi-supervised methods using limited real labels (e.g., FlowDA, CLIP-flow cited in Sec. 1). The comparison to GenStereo and GenWarp in next-frame generation (Table 1) is somewhat unfair as these methods target stereo/view synthesis rather than temporally adjacent frames. The depth model ablation (Table 5) reveals marginal gains between Depth Anything V1 (19.245 PSNR) and V2 (19.243 PSNR), suggesting the framework may be insensitive to depth quality or hitting a ceiling imposed by the generation module.
Reproducibility is currently blocked by the 'code will be released upon acceptance' policy, with no public repository or synthetic data available at submission. The method requires substantial computational resources: fine-tuning Stable Diffusion V1.5 on VKITTI2/TartanAir for 3 epochs using A100/H100 GPUs (Appendix A.1), followed by inference with Depth Anything V2 and GenWarp checkpoints. While hyperparameters are documented (AdamW, lr=$1.0\times 10^{-5}$, DDIM scheduler, batch size 2 per GPU), precise reproduction depends on undisclosed implementation details of the 'differentiable NVS warping module' and the specific GenWarp checkpoint (multi1) used for correspondence computation.
Optical flow estimation is a fundamental problem in computer vision, yet the reliance on expensive ground-truth annotations limits the scalability of supervised approaches. Although unsupervised and semi-supervised methods alleviate this issue, they often suffer from unreliable supervision signals based on brightness constancy and smoothness assumptions, leading to inaccurate motion estimation in complex real-world scenarios. To overcome these limitations, we introduce \textbf{\modelname}, a novel framework that synthesizes large-scale, perfectly aligned frame--flow data pairs for supervised optical flow training without human annotations. Specifically, our method leverages a pre-trained depth estimation network to generate pseudo optical flows, which serve as conditioning inputs for a next-frame generation model trained to produce high-fidelity, pixel-aligned subsequent frames. This process enables the creation of abundant, high-quality synthetic data with precise motion correspondence. Furthermore, we propose an \textit{inconsistent pixel filtering} strategy that identifies and removes unreliable pixels in generated frames, effectively enhancing fine-tuning performance on real-world datasets. Extensive experiments on KITTI2012, KITTI2015, and Sintel demonstrate that \textbf{\modelname} achieves competitive or superior results compared to existing unsupervised and semi-supervised approaches, highlighting its potential as a scalable and annotation-free solution for optical flow learning. We will release our code upon acceptance.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.