Autoregressive vs. Masked Diffusion Language Models: A Controlled Comparison
This paper addresses a key gap in language model research by conducting the first tightly controlled comparison between autoregressive (AR) and masked diffusion language models (MDLM). The author trains both models on identical data (50M tokens from TinyStories), identical compute budget (20K steps, batch size 32), and identical hardware (NVIDIA H100), isolating the generation paradigm as the sole variable. The work is significant because prior studies compared these paradigms at different scales or with different datasets, making it impossible to attribute observed differences to the core architectural distinction itself.
The paper presents a valuable controlled experiment that successfully isolates the effect of generation paradigm on training dynamics and output diversity. The finding that MDLM achieves near-parity in training throughput (95.5% of AR speed) challenges common assumptions about diffusion training costs, and the quantitative diversity analysis over 1,000 samples reveals a striking structural difference: AR exhibits severe prefix mode collapse (99.8% same first word) while MDLM generates 93.4% unique 5-word openings. However, the comparison is complicated by a 31.6% parameter count difference favoring MDLM, and the small scale (123-163M parameters, single dataset) limits generalizability.
The experimental design is exemplary in its control: identical data, compute steps, batch size, sequence length, optimizer settings, and hardware are used for both models. The training throughput measurement is rigorous, showing MDLM at 48,343 tok/s versus AR at 50,620 tok/s, and the convergence dynamics analysis (Table 3) clearly demonstrates AR overfitting by step 14K while MDLM continues improving monotonically through step 20K. The diversity metrics over 1,000 generated samples (Distinct-n, Self-BLEU, unique openings) provide quantitative evidence for the claimed diversity-fluency trade-off.
The most significant limitation is the parameter count mismatch: MDLM has 162.7M parameters versus AR's 123.6M (31.6% larger) due to timestep conditioning modules. This asymmetry complicates attribution of the slower MDLM convergence—does it reflect regularization from masking or simply that a larger model needs more steps to fit? The small scale (single dataset, single random seed, 50M tokens) is acknowledged but still limiting. Additionally, validation losses are not comparable ($\mathcal{L}_{AR}$ measures next-token prediction while $\mathcal{L}_{MDLM}$ measures masked token prediction), which the paper notes but doesn't fully address as a limitation for comparing model quality. The generation quality analysis, while quantitative in diversity metrics, lacks human evaluation or automated quality scores (perplexity on held-out text, MAUVE scores) beyond the qualitative examples shown.
The evidence supports the specific claims about throughput parity and diversity metrics, but comparisons to related work could be strengthened. The paper cites Sahoo et al. (2024) for MDLM methodology, but doesn't verify whether their implementation matches the original (e.g., cosine schedule $\gamma(t)=1-\cos^2(\pi t/2)$ versus the linear schedule in some diffusion variants). The claim that prior studies use different scales cites Sahoo et al. comparing against GPT-2 at different parameter counts and Nie et al. (2025) comparing against LLaMA3 with different data, which is accurate. However, the paper doesn't engage with SEDD (Lou et al., 2024), which demonstrated improved perplexity over GPT-2 with architectural differences that might confound their own comparison.
Reproducibility is excellent: all code, trained checkpoints (PyTorch), data pipelines, and evaluation scripts are released at https://github.com/caiovicentino/arche. The paper provides detailed hyperparameters (AdamW lr=3e-4, wd=0.01), hardware specifications (NVIDIA H100 80GB), and exact compute costs (~\$70 total). However, the MDLM sampler uses several heuristic choices (temperature annealing 1.2→0.5, repetition penalty 1.3, confidence-based unmasking with 100 steps) that the author notes were "not extensively tuned," potentially affecting the diversity-fluency trade-off results. Using only a single random seed for all experiments is a minor weakness.
We present a controlled empirical comparison between autoregressive (AR) and masked diffusion (MDLM) language models. Both models are trained on identical data (50M tokens from TinyStories), identical compute budget (20,000 steps, batch size 32, sequence length 512), and identical hardware (NVIDIA H100 80GB), isolating the generation paradigm as the sole variable. We report three findings. First, both paradigms achieve comparable training throughput (~50K tokens/second), with MDLM requiring only 4.7% more wall-clock time. Second, AR converges faster and begins overfitting by step 14,000, while MDLM converges more slowly and is still improving at step 20,000, suggesting different compute-optimal training regimes. Third, quantitative diversity analysis over 1,000 generated samples reveals a structural diversity-fluency trade-off: AR produces fluent but repetitive outputs (99.8% begin with the same word), while MDLM generates more diverse narratives (93.4% unique 5-word openings, higher Distinct-n, lower Self-BLEU), at the cost of occasional grammatical inconsistencies. All code, trained checkpoints, and data pipelines are released for reproducibility.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.