Uncertainty Quantification for Distribution-to-Distribution Flow Matching in Scientific Imaging

cs.LG Dongxia Wu, Yuhui Zhang, Serena Yeung-Levy, Emma Lundberg, Emily B. Fox · Mar 23, 2026

What it does

This paper addresses uncertainty quantification (UQ) for distribution-to-distribution flow matching, a setting where models map between well-defined source and target distributions (e. g.

Why it matters

, unperturbed to drug-treated cell images) rather than noise-to-data. The authors propose Bayesian Stochastic Flow Matching (BSFM), which combines Stochastic Flow Matching (SFM) for capturing aleatoric uncertainty via learnable diffusion...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper addresses uncertainty quantification (UQ) for distribution-to-distribution flow matching, a setting where models map between well-defined source and target distributions (e.g., unperturbed to drug-treated cell images) rather than noise-to-data. The authors propose Bayesian Stochastic Flow Matching (BSFM), which combines Stochastic Flow Matching (SFM) for capturing aleatoric uncertainty via learnable diffusion terms, with MCD-Antithetic—a scalable Bayesian method using Monte Carlo Dropout and antithetic sampling—to decompose total uncertainty into aleatoric and epistemic components for reliable out-of-distribution (OOD) detection in scientific imaging.

Critical review

Verdict

Bottom line

The paper presents a principled and well-executed study on UQ for distribution-to-distribution generation, a relatively underexplored area. The marginal-preserving SDE derivation for SFM is theoretically sound, and the extensive multi-dataset evaluation (BBBC021, JUMP, fMRI) demonstrates consistent improvements in both generation quality under distribution shifts and OOD detection. However, the experimental design raises concerns: the OOD ground truth is curated via filtering that removes 'misleadingly in-distribution' samples based on prediction error thresholds, which may inflate reported AUROC scores. Additionally, the necessity of flipping the sign on epistemic uncertainty scores (using $-tr(\hat{E})$ due to 'model collapse') suggests fundamental challenges with the Bayesian approximation that are not fully resolved.

What holds up

The marginal-preserving stochastic flow derivation using the Fokker-Planck equation represents solid theoretical grounding. As stated in Section 4.1, the SDE $d{\bm{x}}_t = ({\bm{v}}_\theta({\bm{x}}_t,t,c) - \frac{1}{2}\sigma_t^2 s_\phi({\bm{x}}_t,t,c))dt + \sigma_t d{\bm{W}}_t$ 'shares identical marginals' while introducing controlled stochasticity. The empirical results in Table 1 show consistent FID improvements across all five distribution shift scenarios, with particularly strong gains on severe shifts (e.g., BBBC021 Unseen Perturbations: 103.73 → 33.29). The MCD-Antithetic method achieves state-of-the-art OOD detection performance in Table 2, with AUROC reaching 0.8071 for unseen perturbations versus 0.6849 for SWAG.

“the corresponding SDE with drift correction... shares identical marginals p_t(x_t|c) for all t in [0,1]”

paper, Section 4.1 · Eq. 8

“SFM achieves 33.29 FID vs CellFlux 103.73 on BBBC021 Unseen Perturbations”

paper, Section 5.3 · Table 1

Main concerns

First, the OOD detection evaluation protocol is problematic. The authors filter OOD samples using 'prediction error measured in the feature space' or SSIM, keeping only 'high-error cases' as OOD (Appendix C). This artificially selects for 'easy' OOD samples and removes ambiguous cases, potentially inflating AUROC metrics. Second, the epistemic uncertainty signal requires an ad-hoc sign flip: 'Our choice of -tr instead of tr is motivated by... model collapse leading to overconfidence OOD'. This counter-intuitive behavior—where epistemic uncertainty decreases under distribution shifts—indicates the MC-Dropout approximation may not reliably capture model uncertainty. Third, Proposition 4.1 assumes Lipschitz continuity and Gaussian posterior concentration for the MAP approximation, assumptions unlikely to hold for deep U-Nets in practice. Finally, the computational cost remains substantial: even 'sample-efficient' MCD-Antithetic requires 32 forward passes (8 dropout × 4 SDE samples) per image.

“We filter samples using prediction error... and only flag high-error cases as OOD”

paper, Appendix C · Filtering Strategy

“Our choice of -tr instead of tr is motivated by the well-documented fact that... model collapse leading to overconfidence OOD”

paper, Section 5.2 · Training and evaluation details

Evidence and comparison

The evidence supports the claim that SFM improves generalization, with consistent FID/KID gains across all scenarios in Table 1. The comparison to baselines (CellFlux, UNSB, SDEdit) is fair and comprehensive. However, the OOD detection benchmarking relies on self-selected splits that favor high prediction error, raising questions about generalization to unfiltered OOD data. The comparison between SWAG, Laplace Approximation, and MCD-Antithetic is valuable, though the antithetic sampling benefits are presented without variance estimates or statistical significance tests. The observation that 'epistemic uncertainty performs better on scenarios with severe distribution shifts, whereas aleatoric uncertainty performs better on slight distribution shifts' is well-supported by Table 2 results and provides useful practical guidance, though the underlying mechanism (collapse of conditional variability vs. model agreement) warrants deeper theoretical analysis.

“We observe a consistent pattern that epistemic uncertainty performs better on scenarios with severe distribution shifts (Unseen Pert., Intensity Shift), whereas aleatoric uncertainty performs better on slight distribution shifts (Unseen Cell Lines, Unseen Plates)”

paper, Section 5.3 · Main Results

Reproducibility

The paper provides architectural details (U-Net with IMPA conditioning) and training hyperparameters (100-200 epochs, 2 NVIDIA H100 GPUs), but lacks a public code repository reference or explicit statement of code availability. The datasets (BBBC021, JUMP, ds000228 fMRI) are public, enabling reproduction. However, critical implementation details—such as the specific dropout rates for MC-Dropout, the noise schedule $\sigma_t$ for the SDE, and the exact filtering thresholds ($\mu - 0.5\sigma$, $\mu + 0.5\sigma$) for OOD sample selection—are either unspecified or buried in appendices. The antithetic sampling implementation requires careful handling of Brownian motion negation, which is described but not pseudocoded. Reproduction would be feasible for an expert practitioner but challenging without released code.

“we adopt a U-Net based architecture... train models for 100 epochs on BBBC021, and 200 epochs on JUMP and fMRI using 2 NVIDIA H100 GPUs”

paper, Section 5.2 · Implementation

“we remove ID samples with distances greater than mu-0.5 sigma and remove OOD samples with distances smaller than mu+0.5 sigma”

paper, Appendix C · Filtering

Abstract

Distribution-to-distribution generative models support scientific imaging tasks ranging from modeling cellular perturbation responses to translating medical images across conditions. Trustworthy generation requires both reliability (generalization across labs, devices, and experimental conditions) and accountability (detecting out-of-distribution cases where predictions may be unreliable). Uncertainty quantification (UQ) based approaches serve as promising candidates for these tasks, yet UQ for distribution-to-distribution generative models remains underexplored. We present a unified UQ framework, Bayesian Stochastic Flow Matching (BSFM), that disentangles aleatoric and epistemic uncertainty. The Stochastic Flow Matching (SFM) component augments deterministic flows with a diffusion term to improve model generalization to unseen scenarios. For UQ, we develop a scalable Bayesian approach -- MCD-Antithetic -- that combines Monte Carlo Dropout with sample-efficient antithetic sampling to produce effective anomaly scores for out-of-distribution detection. Experiments on cellular imaging (BBBC021, JUMP) and brain fMRI (Theory of Mind) across diverse scenarios show that SFM improves reliability while MCD-Antithetic enhances accountability.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.