Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization

cs.LG Weilin Wan, Jingtao Han, Weizhong Zhang, Cheng Jin · Mar 23, 2026

What it does

Why it matters

The authors propose a holistic framework that uses algebraic constraints and a rank-preserving property of the hidden dimension $d$ to collapse the search space from $\mathcal{O}(n^{16})$ to manageable two-phase searches of...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper tackles the combinatorial explosion in Mixture-of-Experts (MoE) architecture design, where traditional scaling laws either add too many variables to fit reliably or isolate MoE components while ignoring global interactions. The authors propose a holistic framework that uses algebraic constraints and a rank-preserving property of the hidden dimension $d$ to collapse the search space from $\mathcal{O}(n^{16})$ to manageable two-phase searches of $\mathcal{O}(n^3)+\mathcal{O}(n^2)$. They derive closed-form scaling laws mapping compute budgets to optimal configurations across $10^{18}$ to $3 \times 10^{20}$ FLOPs, revealing that near-optimal architectural bands widen at larger scales—providing actionable guidance for resource-efficient MoE deployment.

Critical review

Verdict

Bottom line

The paper makes a solid methodological contribution by deriving the first holistic MoE scaling laws that account for interactions between global architectural parameters (hidden dimension, layer counts) and MoE-specific factors (expert granularity). The dimension decomposition strategy is rigorous, and the empirical validation across six compute scales and over 670 models provides reasonable support for the claimed scaling laws. However, the extrapolation reliability to future trillion-parameter scales remains uncertain given the limited experimental budget relative to fitted coefficients, and the fixed sparsity configuration ($N_e=288, K=8$) may limit applicability to alternative routing strategies.

“training over 400 language models ranging from 70 million to over 16 billion parameters”

Hoffmann et al., arXiv:2203.15556 · Abstract

“Directly fitting scaling laws across such a high-dimensional space would demand a prohibitive experimental cost, scaling at an impractical rate of $\mathcal{O}(n^{16})$”

Wan et al., Sec. 4.2 · Section 4.2

What holds up

The dimension decomposition methodology is elegant and sound. By applying algebraic constraints to fix hardware-dependent parameters (heads, sequence length) and leveraging a rank-preserving property of the hidden dimension $d$, the authors reduce an intractable 16-dimensional problem to a two-phase search. The rank-preserving assumption is empirically validated with strong statistical evidence (Pearson $r=0.9793$), justifying the use of a median proxy for $d$ during the initial search phase. The resulting scaling laws for $M/N_a$ and $d$ provide deterministic, closed-form recipes for architecture design, and the empirical finding that near-optimal bands widen at larger scales offers practically useful engineering flexibility.

“Pearson $r$ Value 0.9793, $p$-value $7.89 \times 10^{-7}$”

Wan et al., Table 3 · Table 3

“the relative performance rankings of distinct architectural configurations... remain remarkably consistent across a wide range of valid choices for the hidden dimension $d$”

Wan et al., Sec. 4.3.1 · Section 4.3.1

Main concerns

The primary limitation is experimental scale versus ambition: the authors acknowledge that training $\sim$670 models may not guarantee robust power-law coefficients for extrapolation, especially compared to Hoffmann et al.'s 400+ models for just 5 coefficients. A methodological circularity arises because the initial $(C,M,D)$ scaling laws were fitted using heuristic architectures, while the subsequently discovered optimal $(N_a, N, d)$ configurations may deviate from these heuristics—an iterative refinement was omitted due to resource constraints. Furthermore, the entire study fixes $N_e=288$ experts and $K=8$ active experts, so the laws may not transfer to different sparsity regimes. Finally, the evaluation relies solely on pre-training cross-entropy loss, leaving open whether these configurations optimize downstream reasoning versus memorization, which prior work suggests may scale differently for MoE.

“the total number of experiments ($\sim$670 models) may not fully guarantee the robustness of all fitted coefficients, particularly for the power-law relationships that govern extrapolation to larger scales”

Wan et al., Sec. 7 · Section 7, Limitations

“All experiments use $N_e=288$ and $K=8$... the derived scaling laws may not directly transfer to substantially different $(N_e,K)$ configurations”

Wan et al., Sec. 7 · Section 7, Limitations

“Evaluation limited to pre-training loss... MoE scaling behaviors can differ significantly across downstream tasks”

Wan et al., Sec. 7 · Section 7, Limitations

Evidence and comparison

The evidence adequately supports the central methodological claims. The reduction from $\mathcal{O}(n^{16})$ follows logically from the algebraic constraints in Eq. (3) and (4), and the rank-preserving property is statistically validated rather than assumed. The comparison with related work is fair and well-contextualized: the authors correctly identify the gap between studies like Clark et al. (2022) that add MoE variables to scaling laws and those like Ludziejewski et al. (2024) that fix non-MoE factors. They accurately note that existing approaches either require prohibitive experimental budgets or ignore interactions between global and local architectural dimensions, positioning their holistic approach as the necessary middle ground.

“Constrained by experimental budgets, existing MoE scaling studies have generally adopted one of two strategies... This paper identifies and addresses the critical research space in the gap between these two strategies”

Wan et al., Sec. 1 · Section 1

“All computed correlation measures (Pearson $r$, Spearman $\rho$, and Kendall $\tau$) are exceptionally high and statistically significant, with $p$-values well below conventional thresholds”

Wan et al., Sec. 4.3.1 · Section 4.3.1

Reproducibility

Reproducibility is partially addressed but insufficient for full independent verification. The paper provides exhaustive layer-wise FLOPs formulas in Appendix A, detailed hyperparameter settings in Appendix F.1 (including fixed values for $N_e$, $K$, sequence length $S=8192$, and learning rate schedules), and uses the open-source Megatron-LM framework. However, the training dataset is only vaguely described as a 'high-quality curated text corpus' without public links, exact data mixtures, or preprocessing code. No code repository URL or model checkpoints are provided in the text. Without access to the exact data distribution and training scripts, independent reproduction of the specific scaling laws would be difficult, though the methodological framework itself is described clearly enough to be replicated with different data.

“The exhaustive layer-wise analytical formulas used to compute $M$ for a certain MoE configuration”

Wan et al., Appendix A · Appendix A

“all models utilize a high-quality curated text corpus from online sources... Tokenization is performed using a tokenizer with a fixed vocabulary size of 152064”

Wan et al., Sec. 3.4 · Section 3.4

Abstract

Scaling laws for Large Language Models govern macroscopic resource allocation, yet translating them into precise Mixture-of-Experts (MoE) architectural configurations remains an open problem due to the combinatorially vast design space. Existing MoE scaling studies are constrained by experimental budgets to either augment scaling formulas with extra MoE variables, risking unreliable fits, or fix all non-MoE factors, ignoring global interactions. We propose a reusable framework for holistic MoE architectural optimization that bridges this gap. We first show that FLOPs per token alone is an inadequate fairness metric for MoE models because differing computational densities across layer types can inflate parameters without proportional compute cost, and establish a joint constraint triad of FLOPs per token, active parameters, and total parameters. We then reduce the 16-dimensional architectural search space to two sequential low-dimensional phases through algebraic constraints and a rank-preserving property of the hidden dimension. Validated across hundreds of MoE models spanning six orders of magnitude in compute, our framework yields robust scaling laws that map any compute budget to a complete, optimal MoE architecture. A key finding is that the near-optimal configuration band widens with scale, giving practitioners quantitative flexibility to balance scaling law recommendations against infrastructure constraints.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.