On the Role of Batch Size in Stochastic Conditional Gradient Methods
This paper studies how batch size and sequence length should scale with the total token budget in stochastic conditional gradient methods for LLM training. Under a $\mu$-Kurdyka-\L ojasiewicz condition, the authors derive a BST (Batch-Sequence-Token) scaling rule $BS \asymp T^{2/3}$ that predicts three distinct regimes: noise-dominated, batch-independent optimal, and iteration-starved. The theory yields actionable guidelines for adaptive batch size scheduling and is validated on NanoGPT models up to 1B parameters.
The paper presents a theoretically grounded framework for batch size scaling that bridges optimization theory and large-scale practice. The derivation of the $T^{2/3}$ scaling law from first principles under the $\mu$-KL condition is rigorous, and the empirical validation on language model training supports the predicted non-monotone relationship between batch size and optimization error. However, the practical applicability depends on estimating problem-dependent constants ($L$, $\mu$, $\rho$) that may vary across architectures, and the validation is limited to a single model family (NanoGPT) with a specific optimizer (Scion).
The convergence analysis under the $\mu$-KL condition (Theorem 4.1) correctly captures the tension between stochastic noise reduction and iteration budget constraints. The identification of three distinct scaling regimes—noise-dominated, flat optimal, and degradation—is empirically validated in Tables 1-2 and Figure 3, showing that batch sizes beyond the BST threshold indeed hurt performance under fixed token budgets. The distinction between $\mu$P's local stability guarantees and the global trajectory efficiency characterized here is conceptually sharp.
The $\mu$-KL assumption (Assumption 3.3) is verified only for a 124M parameter model and only holds reliably when training loss falls below 5 (Figure 1), leaving its validity for other architectures, scales, or training phases unclear. The power-law estimation of problem constants $L$, $\mu$, and $\rho$ (Section 6.4) requires extensive auxiliary experiments that may be as expensive as the hyperparameter search the theory aims to avoid. Additionally, the variance assumption $\sigma^2 = \sigma_\star^2/(BS)$ (Assumption 3.4) is tested only for batch sizes up to 4096, leaving the theory's predictions for extreme-scale training unvalidated.
The evidence supports the central claims within the experimental domain: the fitted power laws for variance (Figure 2) align with Assumption 3.4, and the restarting strategy (Figure 3) outperforms fixed $\mu$P baselines. The comparison to $\mu$P (Yang et al.) correctly identifies that $\mu$P ensures per-step stability but does not prescribe global scaling with token budget, which this work addresses. However, the paper does not compare against other adaptive batch size heuristics (e.g., Smith et al.'s linear scaling rules) in the main experiments, limiting claims of practical superiority.
The experiments use the publicly available modded-nanogpt codebase and FineWeb dataset, and the paper provides specific hyperparameters in Appendix A. However, reproducing the exact results requires fitting power-law models to estimate $L$, $\mu$, and $\rho$ for new architectures (Section 6.4), which lacks automated procedures and depends on robust linear regression with Huber loss that may be sensitive to training dynamics. No code repository URL is provided in the text, and several experimental details (e.g., exact optimizer configurations for Scion radii) rely on external references to Pethick et al. [2025a].
We study the role of batch size in stochastic conditional gradient methods under a $\mu$-Kurdyka-{\L}ojasiewicz ($\mu$-KL) condition. Focusing on momentum-based stochastic conditional gradient algorithms (e.g., Scion), we derive a new analysis that explicitly captures the interaction between stepsize, batch size, and stochastic noise. Our study reveals a regime-dependent behavior: increasing the batch size initially improves optimization accuracy but, beyond a critical threshold, the benefits saturate and can eventually degrade performance under a fixed token budget. Notably, the theory predicts the magnitude of the optimal stepsize and aligns well with empirical practices observed in large-scale training. Leveraging these insights, we derive principled guidelines for selecting the batch size and stepsize, and propose an adaptive strategy that increases batch size and sequence length during training while preserving convergence guarantees. Experiments on NanoGPT are consistent with the theoretical predictions and illustrate the emergence of the predicted scaling regimes. Overall, our results provide a theoretical framework for understanding batch size scaling in stochastic conditional gradient methods and offer guidance for designing efficient training schedules in large-scale optimization.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.