Climate Prompting: Generating the Madden-Julian Oscillation using Video Diffusion and Low-Dimensional Conditioning

cs.CV Sulian Thual, Feiyang Cai, Jingjing Wang, Feng Luo · Mar 23, 2026

What it does

Why it matters

The core innovation is "climate prompting," where low-dimensional physical indices (MJO phase/amplitude via RMM-PCs, seasonal cycles, ENSO state) serve as conditioning tokens to generate physically consistent high-dimensional atmospheric...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper proposes a conditional video diffusion model trained on ERA5 reanalysis to synthesize the Madden-Julian Oscillation (MJO)—the dominant mode of tropical intraseasonal variability. The core innovation is "climate prompting," where low-dimensional physical indices (MJO phase/amplitude via RMM-PCs, seasonal cycles, ENSO state) serve as conditioning tokens to generate physically consistent high-dimensional atmospheric fields. The work bridges the gap between interpretable low-order climate theory and high-resolution generative models, enabling controlled experiments like perpetual MJOs or isolated seasonal modulations for hypothesis testing.

Critical review

Verdict

Bottom line

The paper presents a compelling proof-of-concept for using conditional video diffusion as a physics-inspired generative tool for climate phenomena. The "climate prompting" paradigm—generating idealized MJO scenarios through controlled low-dimensional conditioning—is innovative and scientifically useful for deconstructing MJO dynamics. The model successfully captures key MJO characteristics including eastward propagation, quadrupole structure, and convectively coupled equatorial waves in wavenumber-frequency spectra. However, the work is somewhat limited by circular conditioning (using PCs derived from output fields as inputs), minimal quantitative validation beyond spectral analysis, and acknowledged biases in representing CC-Kelvin and MRG waves. The reproducibility is partially adequate though key details remain underspecified.

“The model replicates intermittent MJO sequences as in the original record, which amplitude and phase closely follow the conditionings pc1, pc2”

Thual et al. (this paper) · Section 2.1

“the model is biased compared with the ERA5 record as it shows for example less pronounced equatorial CC-Kelvin and MRG waves”

Thual et al. (this paper) · Section 2.1

What holds up

The fundamental concept of using low-dimensional climate indices as diffusion conditioning variables is sound and well-motivated. The wavenumber-frequency analysis (Fig. 2) credibly demonstrates that the model captures the essential spectral signature of the MJO ($k=1$-$3$, $\omega=0.01$-$0.03$ cpd) plus embedded convectively coupled Kelvin, Rossby, and MRG waves. The ensemble sampling analysis (Fig. 3) effectively illustrates the model's stochastic diversity—showing realisitic seasonal variations in ensemble spread and freely generated equatorial waves as sample deviations. The progressive experimental design (isolated MJO $\rightarrow$ seasonal modulation $\rightarrow$ ENSO modulation) logically demonstrates how conditioning variables interact to shape MJO characteristics.

“The model is able to reproduce the prominent MJO signal (k=1-3, w=0.01-0.03 i.e. 30-90 days) as well as other prominent equatorial waves”

Thual et al. (this paper) · Figure 2 caption

“The ensemble standard deviation depicts the sample diversity: it shows marked seasonal variations with a maximum in boreal winter in the western to central Pacific warm pool region”

Thual et al. (this paper) · Section 2.2

Main concerns

The most critical issue is circularity in the conditioning: the RMM-UBC index (pc1, pc2) is derived directly from the model output fields (UBC and OLR) via projection onto empirical orthogonal functions, yet these same PCs serve as conditioning inputs. As acknowledged, "the principal components deduced from the sampled video closely match the conditionings" because they are computed from the generated fields themselves—this tautology limits scientific interpretability of how independent the generated structures truly are from the prompting. The validation relies almost exclusively on wavenumber-frequency spectra composites rather than rigorous statistical metrics (e.g., MJO skill scores, bivariate correlation, amplitude/error distributions). Training details are concerning: only 20,000 steps with condition dropout 0.1 appears minimal for such high-dimensional spatiotemporal data, yet no training curves or convergence diagnostics are provided. The claim that generation takes ~30 minutes for 60 years and is "on par with intermediate complexity models" lacks comparative citations or standardized benchmarking.

“The principal components deduced directly from the sample fields match the conditionings (gray dashed lines in a)”

Thual et al. (this paper) · Section 2.1

“Training steps: 20,000”

Thual et al. (this paper) · Table 3

“in our setup it takes around 30 minutes to generate a 60 years MJO record, which is roughly on par with intermediate complexity models in terms of computing time”

Thual et al. (this paper) · Section 4

Evidence and comparison

The spectral evidence (Figs. 2, 4, 6) supporting MJO and equatorial wave capture is visually convincing but lacks quantitative rigor. The comparison to ERA5 is qualitative rather than statistical—no RMSE, anomaly correlation, or MJO-specific metrics (RMM bivariate correlation, amplitude ratio) are reported. The paper positions itself against "traditional statistical methods" and "low-order models" but does not provide direct quantitative comparisons to these baselines. The cited related work in diffusion models for climate (Stock et al. 2024, Ren et al. 2025, Price et al. 2025) is appropriate, though the novelty relative to these approaches is primarily the specific focus on MJO low-dimensional conditioning rather than architectural advances. The claim that low-dimensional conditioning "decouples processes" is partially supported by the seasonal and ENSO modulation experiments (Figs. 6-7), though the interpretability benefit over simple compositing of observed data remains arguable.

“The generated MJOs capture key features including composites, power spectra and multiscale structures including convectively coupled waves, despite some bias”

Thual et al. (this paper) · Abstract

“The present method may outperform traditional statistical methods (e.g. MJO composites, principal components) at reconstructing details embedded within the MJO”

Thual et al. (this paper) · Section 4

Reproducibility

Reproducibility is partially addressed but has significant gaps. The architecture (U-Net with transformers, spatial/temporal attention) and key hyperparameters (4 hierarchy levels, 8 attention heads, 250 DDIM steps, $\eta=1$) are documented in Table 3. Data sources (WeatherBench2 ERA5, NOAA ERSSTv5) and preprocessing (Butterworth high-pass filter at 120 days, climatology removal) are specified. However, critical reproducibility barriers include: no code repository URL provided (only "adapted from Bastek et al. 2023"), no random seed specification for the stochastic DDIM sampling with $\eta=1$, no training/validation loss curves to assess convergence, and undisclosed optimizer details (learning rate, schedule, batch size). The 20,000 training step count seems suspiciously low—at 64×16 spatial resolution across 4 fields and 16-frame sequences, this suggests either very small effective training data or potential undertraining not disclosed in validation metrics. The "Brick-Wall Denoising" method (stride 3) lacks pseudocode or sufficient detail for independent implementation.

“Training steps: 20,000...Sampling method: DDIM...DDIM timesteps: 250...DDIM eta (η): 1”

Thual et al. (this paper) · Table 3

“The video diffusion model code (pytorch) is adapted from Bastek et al. 2023”

Thual et al. (this paper) · Methods - Data availability

“Brick-Wall Denoising uses a stride of 3 frames”

Thual et al. (this paper) · Methods - Video Diffusion Model

Abstract

Generative Deep Learning is a powerful tool for modeling of the Madden-Julian oscillation (MJO) in the tropics, yet its relationship to traditional theoretical frameworks remains poorly understood. Here we propose a video diffusion model, trained on atmospheric reanalysis, to synthetize long MJO sequences conditioned on key low-dimensional metrics. The generated MJOs capture key features including composites, power spectra and multiscale structures including convectively coupled waves, despite some bias. We then prompt the model to generate more tractable MJOs based on intentionally idealized low-dimensional conditionings, for example a perpetual MJO, an isolated modulation by seasons and/or the El Nino-Southern Oscillation, and so on. This enables deconstructing the underlying processes and identifying physical drivers. The present approach provides a practical framework for bridging the gap between low-dimensional MJO theory and high-resolution atmospheric complexity and will help tropical atmosphere prediction.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.