Generalized Discrete Diffusion from Snapshots

stat.ML cs.AI cs.CL cs.LG Oussama Zekri, Th\'eo Uscidda, Nicolas Boull\'e, Anna Korba · Mar 22, 2026
Local to this browser
What it does
Discrete diffusion models have been limited to simplistic noising schemes like uniform corruption or masking, restricting their ability to leverage semantic structure in large vocabularies. This paper introduces GDDS (Generalized Discrete...
Why it matters
This paper introduces GDDS (Generalized Discrete Diffusion from Snapshots), a framework supporting arbitrary continuous-time Markov chain noising processes via exact uniformization-based sampling and a tractable snapshot-level ELBO. The...
Main concern
GDDS offers a rigorous mathematical unification of discrete diffusion modeling. The interpolating matrix formulation $K_t = \alpha_t I_m + (1-\alpha_t)\Pi_t$ in Eq.
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Discrete diffusion models have been limited to simplistic noising schemes like uniform corruption or masking, restricting their ability to leverage semantic structure in large vocabularies. This paper introduces GDDS (Generalized Discrete Diffusion from Snapshots), a framework supporting arbitrary continuous-time Markov chain noising processes via exact uniformization-based sampling and a tractable snapshot-level ELBO. The work achieves state-of-the-art results on large-scale language modeling tasks, claiming to surpass autoregressive baselines for the first time at this scale.

Critical review
Verdict
Bottom line

GDDS offers a rigorous mathematical unification of discrete diffusion modeling. The interpolating matrix formulation $K_t = \alpha_t I_m + (1-\alpha_t)\Pi_t$ in Eq. 6 genuinely encompasses prior approaches, while the uniformization algorithm enables exact sampling without matrix exponentiation. The snapshot ELBO provides a clean training objective compatible with standard bidirectional transformers. However, the headline claim of beating autoregressive models requires qualification: the AR baseline uses causal attention whereas GDDS uses bidirectional DDiT, giving the latter an inherent advantage for perplexity metrics. While the paper notes this architectural difference in Section 5, framing the result as purely a diffusion-versus-autoregressive victory is misleading.

“Diffusion models use a DDiT backbone with bidirectional attention and time conditioning... the autoregressive (AR) baseline uses the same backbone with causal self-attention”
What holds up

The theoretical framework is sound and significant. The Proposition in Section 3.1 showing that any rate matrix $Q_t$ can be represented via the interpolating form $K_t = \alpha_t I_m + (1-\alpha_t)\Pi_t$ is mathematically rigorous and genuinely unifies prior masked and uniform diffusion schemes. The uniformization-based Algorithm 1 for exact forward sampling is elegant and practical, requiring only column access to the rate matrix rather than expensive matrix exponentials. The snapshot ELBO derived in Section 4.3, optimizing $\mathcal{L}_{x_0}^{\text{snap}} = \int_0^1 \mathbb{E}[-\log \mu_\theta(x_t,t)_{[x_0]}] dt$, correctly aligns the training objective with the mean parametrization and standard transformer architectures.

“$K_t := \alpha_t I_m + (1-\alpha_t)\Pi_t$”
“$\mathcal{L}_{x_0}^{\mathrm{snap}}(\theta)=\int_{0}^{1}\mathbb{E}_{x_{t}\sim q_{t}(\cdot\mid x_{0})}\left[-\log\mu_{\theta}(x_{t},t)_{[x_{0}]}\right]\,\mathrm{d}t$”
Generalized Discrete Diffusion from Snapshots · Section 4.3, Proposition [Snapshot ELBO]
Main concerns

The primary concern is the architectural mismatch in the autoregressive comparison. As stated in Section 5, the AR baseline uses causal self-attention while diffusion models use bidirectional attention. Bidirectional models inherently achieve lower perplexity than causal models with comparable parameters because they access future context during training, making the comparison in Table 2 (PPL 8.98 vs 20.49) not apples-to-apples. Additionally, while the semantic-informed kernel (Gauss) shows impressive results in Table 3, the computational overhead of KNN-based sampling with $k=64$ neighbors per token is not quantified against simpler uniform/mask baselines in terms of wall-clock training time or memory. Finally, the information-calibration decomposition in Section 4.3 assumes the snapshot objective minimizes the calibration gap $\mathrm{Cal}_\theta^s$, but does not empirically verify this holds across arbitrary noising processes.

“$\Delta^{NLL}_{\theta}=\underbrace{H(x_{0}\mid s)-H(x_{0}\mid\omega)}_{\mathrm{IPG}\geq 0}+\underbrace{\,\mathrm{Cal}_{\theta}^{s}-\,\mathrm{Cal}_{\theta}^{\omega}}_{\mathrm{CG}}$”
Evidence and comparison

The experimental evidence strongly supports superiority over existing discrete diffusion methods. Table 1 shows GDDS Absorb achieving 1.16 BPC on Text8 versus MDM's 1.58, and Table 2 reports OWT validation perplexity of 7.65 for GDDS Gauss versus 31.03 for MDM and 36.82 for UDLM. Zero-shot transfer results in Table 3 demonstrate that semantic noising provides consistent generalization benefits across seven downstream datasets. However, comparisons to prior work like GIDD (von2025generalized) and kinetic-optimal flow matching (Shaul et al., 2024) are discussed in Section 3.1, but the empirical comparisons focus mainly on MDM and UDLM. The lack of wall-clock time comparisons for the semantic kernel (which requires KNN lookups) versus standard kernels limits assessment of computational efficiency claims.

“GDDS Gauss (Ours) ... $\leq 07.65$ vs AR (retrain) ... $20.49$”
“we propose velocity formulas that can be applied to any given probability path”
Reproducibility

The paper provides detailed algorithms (1, 2, 3) and reports hyperparameters in Appendix C. The authors commit to releasing code and provide a project page. The main barrier to reproduction would be computational resources: 500k training steps on OpenWebText requires significant GPU time. The semantic kernel implementation using KNN with $k=64$ neighbors is specified, though the exact memory requirements for storing embeddings for vocabularies of size 50,257 are not detailed. The uniformization sampling method (Algorithm 1) is efficient and requires only Poisson sampling and column-wise transitions, making it reproducible without specialized hardware beyond standard deep learning infrastructure.

“Sample number of jumps $N_t\sim\mathrm{Poisson}(\bar{f}(t))$ ... Sample jump $z_k\sim F_{T_k}(\cdot,z_{k-1})$”
Abstract

We introduce Generalized Discrete Diffusion from Snapshots (GDDS), a unified framework for discrete diffusion modeling that supports arbitrary noising processes over large discrete state spaces. Our formulation encompasses all existing discrete diffusion approaches, while allowing significantly greater flexibility in the choice of corruption dynamics. The forward noising process relies on uniformization and enables fast arbitrary corruption. For the reverse process, we derive a simple evidence lower bound (ELBO) based on snapshot latents, instead of the entire noising path, that allows efficient training of standard generative modeling architectures with clear probabilistic interpretation. Our experiments on large-vocabulary discrete generation tasks suggest that the proposed framework outperforms existing discrete diffusion methods in terms of training efficiency and generation quality, and beats autoregressive models for the first time at this scale. We provide the code along with a blog post on the project page : \href{https://oussamazekri.fr/gdds}{https://oussamazekri.fr/gdds}.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.