Pretrained Video Models as Differentiable Physics Simulators for Urban Wind Flows

cs.LG cs.CE Janne Perini, Rafael Bischof, Moab Arar, Ay\c{c}a Duran, Michael A. Kraus, Siddhartha Mishra, Bernd Bickel · Mar 22, 2026
Local to this browser
What it does
WinDiNet repurposes the LTX-Video latent diffusion transformer as a fast, differentiable surrogate for urban wind flow simulation, addressing the prohibitive cost of time-resolved CFD in design exploration. By fine-tuning the 2B-parameter...
Why it matters
WinDiNet repurposes the LTX-Video latent diffusion transformer as a fast, differentiable surrogate for urban wind flow simulation, addressing the prohibitive cost of time-resolved CFD in design exploration. By fine-tuning the 2B-parameter...
Main concern
The paper presents a compelling demonstration of transfer learning from natural video to scientific simulation, showing that a properly adapted pretrained video model can outperform neural operators designed specifically for PDE solving....
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

WinDiNet repurposes the LTX-Video latent diffusion transformer as a fast, differentiable surrogate for urban wind flow simulation, addressing the prohibitive cost of time-resolved CFD in design exploration. By fine-tuning the 2B-parameter video model on 10,000 2D incompressible CFD simulations over procedurally generated building layouts, the authors achieve sub-second generation of 112-frame rollouts while enabling end-to-end gradient-based optimization of building positions for pedestrian wind comfort.

Critical review
Verdict
Bottom line

The paper presents a compelling demonstration of transfer learning from natural video to scientific simulation, showing that a properly adapted pretrained video model can outperform neural operators designed specifically for PDE solving. The systematic ablation of LoRA versus full fine-tuning, text versus scalar conditioning, and VAE adaptation strategies provides valuable methodological guidance for the field. The inverse optimization experiments convincingly demonstrate practical utility, though the 2D simplification and lack of extensive real-world validation temper the immediate applicability.

“identifies a configuration that outperforms purpose-built neural PDE solvers”
paper · Abstract
“The dataset, code and model weights will be released upon publication”
paper · Section 1
What holds up

The VAE adaptation study is rigorous, showing that decoder fine-tuning with physics-informed losses (divergence and no-penetration penalties) reduces reconstruction VRMSE by 62% over the frozen baseline and yields a 7.6% improvement over the best neural operator baseline (RNO). The scalar conditioning mechanism, which bypasses the text encoder to inject simulation parameters directly via Fourier features, proves superior to text prompts for continuous physical quantities. The differentiable pipeline enables meaningful layout optimization, reducing dangerous wind speeds (\>15 m/s) from 2.6% to 0.2% as confirmed by ground-truth CFD.

“Dec. FT Physics outperforms the best baseline (RNO) by 7.6% in VRMSE and 15% in MAE”
paper · Section 4.3
“dangerous wind speeds above 15 m/s drop from 2.6% to 0.2%”
paper · Section 5
Main concerns

The evaluation relies almost exclusively on synthetic procedural data, with only a preliminary qualitative test on four real urban configurations shown in Appendix F, raising substantial questions about out-of-distribution generalization to irregular real-world geometries. The 2D incompressible Euler formulation neglects vertical wind components, turbulent dispersion, and 3D building effects that are critical for accurate pedestrian comfort assessment. While the physics-informed losses improve VAE reconstruction, they are not applied end-to-end through the diffusion model itself, meaning the generative process lacks explicit constraints enforcing incompressibility or no-penetration conditions during sampling.

“preliminary test of out-of-distribution generalization”
paper · Appendix F
“The current formulation operates in 2D. An extension to 3D is relatively straightforward”
paper · Section 6
Evidence and comparison

The comparison against six neural operator baselines (U-Net, Poseidon, AFNO, FNO, OFormer, RNO) is comprehensive and fair regarding metric computation (evaluated on fluid pixels only), though the baselines operate at different temporal resolutions due to memory constraints. The authors acknowledge that autoregressive baselines struggle with error accumulation over long rollouts, whereas WinDiNet generates all frames jointly in a single denoising pass, which is a significant architectural advantage for temporal consistency. Crucially, the inverse optimization results are validated against ground-truth CFD simulations, confirming that improvements discovered via the surrogate translate to the true physics rather than exploiting model artifacts.

“All metrics computed on fluid pixels only”
paper · Table 2
“autoregressive models... struggle to maintain accuracy over long rollouts when trained from scratch on limited domain-specific data”
paper · Section 1
Reproducibility

The authors commit to releasing the dataset, code, and model weights upon publication, which would significantly aid reproduction. Training hyperparameters are reported in detail, including AdamW optimizers with learning rates $10^{-4}$ (LoRA) and $10^{-5}$ (full fine-tuning), batch size 64, and cosine schedules. However, the CFD solver lacks specific numerical discretization details (grid resolution, time-stepping scheme, CFL conditions), and the procedural generation algorithm for building layouts is not fully specified, potentially complicating exact reproduction of the training distribution. The reliance on a specific commercial foundation model (LTX-Video) with 2B parameters may also limit accessibility compared to smaller task-specific architectures.

“Training uses AdamW (lr=10^{-4}, cosine schedule), batch size 64 for 2,000 steps”
paper · Section 4.2
“The dataset, code and model weights will be released upon publication”
paper · Section 1
Abstract

Designing urban spaces that provide pedestrian wind comfort and safety requires time-resolved Computational Fluid Dynamics (CFD) simulations, but their current computational cost makes extensive design exploration impractical. We introduce WinDiNet (Wind Diffusion Network), a pretrained video diffusion model that is repurposed as a fast, differentiable surrogate for this task. Starting from LTX-Video, a 2B-parameter latent video transformer, we fine-tune on 10,000 2D incompressible CFD simulations over procedurally generated building layouts. A systematic study of training regimes, conditioning mechanisms, and VAE adaptation strategies, including a physics-informed decoder loss, identifies a configuration that outperforms purpose-built neural PDE solvers. The resulting model generates full 112-frame rollouts in under a second. As the surrogate is end-to-end differentiable, it doubles as a physics simulator for gradient-based inverse optimization: given an urban footprint layout, we optimize building positions directly through backpropagation to improve wind safety as well as pedestrian wind comfort. Experiments on single- and multi-inlet layouts show that the optimizer discovers effective layouts even under challenging multi-objective configurations, with all improvements confirmed by ground-truth CFD simulations.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.