Autoregressive vs. Masked Diffusion Language Models: A Controlled Comparison

cs.CL Caio Vicentino · Mar 23, 2026
Local to this browser
What it does
This paper addresses a key gap in language model research by conducting the first tightly controlled comparison between autoregressive (AR) and masked diffusion language models (MDLM). The author trains both models on identical data (50M...
Why it matters
The author trains both models on identical data (50M tokens from TinyStories), identical compute budget (20K steps, batch size 32), and identical hardware (NVIDIA H100), isolating the generation paradigm as the sole variable. The work is...
Main concern
The paper presents a valuable controlled experiment that successfully isolates the effect of generation paradigm on training dynamics and output diversity. The finding that MDLM achieves near-parity in training throughput (95.
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper addresses a key gap in language model research by conducting the first tightly controlled comparison between autoregressive (AR) and masked diffusion language models (MDLM). The author trains both models on identical data (50M tokens from TinyStories), identical compute budget (20K steps, batch size 32), and identical hardware (NVIDIA H100), isolating the generation paradigm as the sole variable. The work is significant because prior studies compared these paradigms at different scales or with different datasets, making it impossible to attribute observed differences to the core architectural distinction itself.

Critical review
Verdict
Bottom line

The paper presents a valuable controlled experiment that successfully isolates the effect of generation paradigm on training dynamics and output diversity. The finding that MDLM achieves near-parity in training throughput (95.5% of AR speed) challenges common assumptions about diffusion training costs, and the quantitative diversity analysis over 1,000 samples reveals a striking structural difference: AR exhibits severe prefix mode collapse (99.8% same first word) while MDLM generates 93.4% unique 5-word openings. However, the comparison is complicated by a 31.6% parameter count difference favoring MDLM, and the small scale (123-163M parameters, single dataset) limits generalizability.

“MDLM requiring only 4.7% more wall-clock time, countering the perception that diffusion training is substantially more expensive”
paper · Abstract
“99.8% of AR samples begin with the same word ("Once"), and only 3.3% have a unique 5-word opening. In contrast, 36.1% of MDLM samples begin with a unique first word, and 93.4% have a unique 5-word opening”
paper · Section 4.3
What holds up

The experimental design is exemplary in its control: identical data, compute steps, batch size, sequence length, optimizer settings, and hardware are used for both models. The training throughput measurement is rigorous, showing MDLM at 48,343 tok/s versus AR at 50,620 tok/s, and the convergence dynamics analysis (Table 3) clearly demonstrates AR overfitting by step 14K while MDLM continues improving monotonically through step 20K. The diversity metrics over 1,000 generated samples (Distinct-n, Self-BLEU, unique openings) provide quantitative evidence for the claimed diversity-fluency trade-off.

“Training time: AR 107.9 min, MDLM 113.0 min. Throughput: AR 50,620 tok/s, MDLM 48,343 tok/s”
paper · Table 2
“AR overfits after step 14K (val loss 1.589 → 1.622); MDLM continues improving monotonically (val loss 3.952 → 3.412)”
paper · Table 3
Main concerns

The most significant limitation is the parameter count mismatch: MDLM has 162.7M parameters versus AR's 123.6M (31.6% larger) due to timestep conditioning modules. This asymmetry complicates attribution of the slower MDLM convergence—does it reflect regularization from masking or simply that a larger model needs more steps to fit? The small scale (single dataset, single random seed, 50M tokens) is acknowledged but still limiting. Additionally, validation losses are not comparable ($\mathcal{L}_{AR}$ measures next-token prediction while $\mathcal{L}_{MDLM}$ measures masked token prediction), which the paper notes but doesn't fully address as a limitation for comparing model quality. The generation quality analysis, while quantitative in diversity metrics, lacks human evaluation or automated quality scores (perplexity on held-out text, MAUVE scores) beyond the qualitative examples shown.

“The MDLM model has 31.6% more parameters than the AR model (162.7M vs. 123.6M). This difference is not an oversight but reflects inherent architectural requirements”
paper · Section 3.2
“Val losses use different objectives and are not cross-comparable”
paper · Section 4.2
Evidence and comparison

The evidence supports the specific claims about throughput parity and diversity metrics, but comparisons to related work could be strengthened. The paper cites Sahoo et al. (2024) for MDLM methodology, but doesn't verify whether their implementation matches the original (e.g., cosine schedule $\gamma(t)=1-\cos^2(\pi t/2)$ versus the linear schedule in some diffusion variants). The claim that prior studies use different scales cites Sahoo et al. comparing against GPT-2 at different parameter counts and Nie et al. (2025) comparing against LLaMA3 with different data, which is accurate. However, the paper doesn't engage with SEDD (Lou et al., 2024), which demonstrated improved perplexity over GPT-2 with architectural differences that might confound their own comparison.

“Prior comparisons between AR and diffusion language models exist but are not tightly controlled. Sahoo et al. (2024) compare against GPT-2 but at different parameter counts”
paper · Section 2.3
Reproducibility

Reproducibility is excellent: all code, trained checkpoints (PyTorch), data pipelines, and evaluation scripts are released at https://github.com/caiovicentino/arche. The paper provides detailed hyperparameters (AdamW lr=3e-4, wd=0.01), hardware specifications (NVIDIA H100 80GB), and exact compute costs (~\$70 total). However, the MDLM sampler uses several heuristic choices (temperature annealing 1.2→0.5, repetition penalty 1.3, confidence-based unmasking with 100 steps) that the author notes were "not extensively tuned," potentially affecting the diversity-fluency trade-off results. Using only a single random seed for all experiments is a minor weakness.

“All code, data pipelines, trained checkpoints, and evaluation scripts released at https://github.com/caiovicentino/arche”
paper · Section 5
“Single seed. All experiments use one random seed. Multiple seeds with confidence intervals would strengthen all claims”
paper · Section 5.4 Limitations
Abstract

We present a controlled empirical comparison between autoregressive (AR) and masked diffusion (MDLM) language models. Both models are trained on identical data (50M tokens from TinyStories), identical compute budget (20,000 steps, batch size 32, sequence length 512), and identical hardware (NVIDIA H100 80GB), isolating the generation paradigm as the sole variable. We report three findings. First, both paradigms achieve comparable training throughput (~50K tokens/second), with MDLM requiring only 4.7% more wall-clock time. Second, AR converges faster and begins overfitting by step 14,000, while MDLM converges more slowly and is still improving at step 20,000, suggesting different compute-optimal training regimes. Third, quantitative diversity analysis over 1,000 generated samples reveals a structural diversity-fluency trade-off: AR produces fluent but repetitive outputs (99.8% begin with the same word), while MDLM generates more diverse narratives (93.4% unique 5-word openings, higher Distinct-n, lower Self-BLEU), at the cost of occasional grammatical inconsistencies. All code, trained checkpoints, and data pipelines are released for reproducibility.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.