Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization
This paper tackles the combinatorial explosion in Mixture-of-Experts (MoE) architecture design, where traditional scaling laws either add too many variables to fit reliably or isolate MoE components while ignoring global interactions. The authors propose a holistic framework that uses algebraic constraints and a rank-preserving property of the hidden dimension $d$ to collapse the search space from $\mathcal{O}(n^{16})$ to manageable two-phase searches of $\mathcal{O}(n^3)+\mathcal{O}(n^2)$. They derive closed-form scaling laws mapping compute budgets to optimal configurations across $10^{18}$ to $3 \times 10^{20}$ FLOPs, revealing that near-optimal architectural bands widen at larger scales—providing actionable guidance for resource-efficient MoE deployment.
The paper makes a solid methodological contribution by deriving the first holistic MoE scaling laws that account for interactions between global architectural parameters (hidden dimension, layer counts) and MoE-specific factors (expert granularity). The dimension decomposition strategy is rigorous, and the empirical validation across six compute scales and over 670 models provides reasonable support for the claimed scaling laws. However, the extrapolation reliability to future trillion-parameter scales remains uncertain given the limited experimental budget relative to fitted coefficients, and the fixed sparsity configuration ($N_e=288, K=8$) may limit applicability to alternative routing strategies.
The dimension decomposition methodology is elegant and sound. By applying algebraic constraints to fix hardware-dependent parameters (heads, sequence length) and leveraging a rank-preserving property of the hidden dimension $d$, the authors reduce an intractable 16-dimensional problem to a two-phase search. The rank-preserving assumption is empirically validated with strong statistical evidence (Pearson $r=0.9793$), justifying the use of a median proxy for $d$ during the initial search phase. The resulting scaling laws for $M/N_a$ and $d$ provide deterministic, closed-form recipes for architecture design, and the empirical finding that near-optimal bands widen at larger scales offers practically useful engineering flexibility.
The primary limitation is experimental scale versus ambition: the authors acknowledge that training $\sim$670 models may not guarantee robust power-law coefficients for extrapolation, especially compared to Hoffmann et al.'s 400+ models for just 5 coefficients. A methodological circularity arises because the initial $(C,M,D)$ scaling laws were fitted using heuristic architectures, while the subsequently discovered optimal $(N_a, N, d)$ configurations may deviate from these heuristics—an iterative refinement was omitted due to resource constraints. Furthermore, the entire study fixes $N_e=288$ experts and $K=8$ active experts, so the laws may not transfer to different sparsity regimes. Finally, the evaluation relies solely on pre-training cross-entropy loss, leaving open whether these configurations optimize downstream reasoning versus memorization, which prior work suggests may scale differently for MoE.
The evidence adequately supports the central methodological claims. The reduction from $\mathcal{O}(n^{16})$ follows logically from the algebraic constraints in Eq. (3) and (4), and the rank-preserving property is statistically validated rather than assumed. The comparison with related work is fair and well-contextualized: the authors correctly identify the gap between studies like Clark et al. (2022) that add MoE variables to scaling laws and those like Ludziejewski et al. (2024) that fix non-MoE factors. They accurately note that existing approaches either require prohibitive experimental budgets or ignore interactions between global and local architectural dimensions, positioning their holistic approach as the necessary middle ground.
Reproducibility is partially addressed but insufficient for full independent verification. The paper provides exhaustive layer-wise FLOPs formulas in Appendix A, detailed hyperparameter settings in Appendix F.1 (including fixed values for $N_e$, $K$, sequence length $S=8192$, and learning rate schedules), and uses the open-source Megatron-LM framework. However, the training dataset is only vaguely described as a 'high-quality curated text corpus' without public links, exact data mixtures, or preprocessing code. No code repository URL or model checkpoints are provided in the text. Without access to the exact data distribution and training scripts, independent reproduction of the specific scaling laws would be difficult, though the methodological framework itself is described clearly enough to be replicated with different data.
Scaling laws for Large Language Models govern macroscopic resource allocation, yet translating them into precise Mixture-of-Experts (MoE) architectural configurations remains an open problem due to the combinatorially vast design space. Existing MoE scaling studies are constrained by experimental budgets to either augment scaling formulas with extra MoE variables, risking unreliable fits, or fix all non-MoE factors, ignoring global interactions. We propose a reusable framework for holistic MoE architectural optimization that bridges this gap. We first show that FLOPs per token alone is an inadequate fairness metric for MoE models because differing computational densities across layer types can inflate parameters without proportional compute cost, and establish a joint constraint triad of FLOPs per token, active parameters, and total parameters. We then reduce the 16-dimensional architectural search space to two sequential low-dimensional phases through algebraic constraints and a rank-preserving property of the hidden dimension. Validated across hundreds of MoE models spanning six orders of magnitude in compute, our framework yields robust scaling laws that map any compute budget to a complete, optimal MoE architecture. A key finding is that the near-optimal configuration band widens with scale, giving practitioners quantitative flexibility to balance scaling law recommendations against infrastructure constraints.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.