On the Role of Batch Size in Stochastic Conditional Gradient Methods

cs.LG math.OC stat.ML Rustem Islamov, Roman Machacek, Aurelien Lucchi, Antonio Silveti-Falls, Eduard Gorbunov, Volkan Cevher · Mar 22, 2026
Local to this browser
What it does
This paper studies how batch size and sequence length should scale with the total token budget in stochastic conditional gradient methods for LLM training. Under a $\mu$-Kurdyka-\L ojasiewicz condition, the authors derive a BST...
Why it matters
Under a $\mu$-Kurdyka-\L ojasiewicz condition, the authors derive a BST (Batch-Sequence-Token) scaling rule $BS \asymp T^{2/3}$ that predicts three distinct regimes: noise-dominated, batch-independent optimal, and iteration-starved. The...
Main concern
The paper presents a theoretically grounded framework for batch size scaling that bridges optimization theory and large-scale practice. The derivation of the $T^{2/3}$ scaling law from first principles under the $\mu$-KL condition is...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper studies how batch size and sequence length should scale with the total token budget in stochastic conditional gradient methods for LLM training. Under a $\mu$-Kurdyka-\L ojasiewicz condition, the authors derive a BST (Batch-Sequence-Token) scaling rule $BS \asymp T^{2/3}$ that predicts three distinct regimes: noise-dominated, batch-independent optimal, and iteration-starved. The theory yields actionable guidelines for adaptive batch size scheduling and is validated on NanoGPT models up to 1B parameters.

Critical review
Verdict
Bottom line

The paper presents a theoretically grounded framework for batch size scaling that bridges optimization theory and large-scale practice. The derivation of the $T^{2/3}$ scaling law from first principles under the $\mu$-KL condition is rigorous, and the empirical validation on language model training supports the predicted non-monotone relationship between batch size and optimization error. However, the practical applicability depends on estimating problem-dependent constants ($L$, $\mu$, $\rho$) that may vary across architectures, and the validation is limited to a single model family (NanoGPT) with a specific optimizer (Scion).

“Then, the output of Algorithm 1 after K iterations satisfies $\mathbb{E}[f(x_{K})-f^{\star}]\leq\varepsilon$”
paper · Corollary 4.1 (BST Scaling Rule)
“Three regimes emerge: (i) a noise-dominated regime where increasing BS improves performance, (ii) an intermediate regime where the best achievable error is essentially independent of BS, and (iii) a large-batch regime where performance deteriorates as BS grows under a fixed token budget.”
paper · Section 4
What holds up

The convergence analysis under the $\mu$-KL condition (Theorem 4.1) correctly captures the tension between stochastic noise reduction and iteration budget constraints. The identification of three distinct scaling regimes—noise-dominated, flat optimal, and degradation—is empirically validated in Tables 1-2 and Figure 3, showing that batch sizes beyond the BST threshold indeed hurt performance under fixed token budgets. The distinction between $\mu$P's local stability guarantees and the global trajectory efficiency characterized here is conceptually sharp.

“Let Assumptions (A1), (A2), (A3), and (3.4) hold... Then, the output of Algorithm 1 after K iterations satisfies $\mathbb{E}[f(x_{K})-f^{\star}]\leq\varepsilon$”
paper · Theorem 4.1
“We observe that once the batch size (or sequence length) is sufficiently large, the optimal Frank-Wolfe stepsize stabilizes at $3.6\cdot 10^{-4}$... This aligns with Corollary 4.1, which shows that there exists a batch-independent regime.”
paper · Section 6.3
Main concerns

The $\mu$-KL assumption (Assumption 3.3) is verified only for a 124M parameter model and only holds reliably when training loss falls below 5 (Figure 1), leaving its validity for other architectures, scales, or training phases unclear. The power-law estimation of problem constants $L$, $\mu$, and $\rho$ (Section 6.4) requires extensive auxiliary experiments that may be as expensive as the hyperparameter search the theory aims to avoid. Additionally, the variance assumption $\sigma^2 = \sigma_\star^2/(BS)$ (Assumption 3.4) is tested only for batch sizes up to 4096, leaving the theory's predictions for extreme-scale training unvalidated.

“The points with a loss below 5 fit a linear function well, with a slope equal to $\mu$”
paper · Figure 1
“We conduct the estimation procedure for several model configurations and fit a shifted power law... for the problem constants $L, \mu, \rho$”
paper · Section 6.4
“Note that in our experiments, the remaining hyperparameters... were adopted directly from Pethick et al. (2025a) without additional tuning”
paper · Limitations and Future Work
Evidence and comparison

The evidence supports the central claims within the experimental domain: the fitted power laws for variance (Figure 2) align with Assumption 3.4, and the restarting strategy (Figure 3) outperforms fixed $\mu$P baselines. The comparison to $\mu$P (Yang et al.) correctly identifies that $\mu$P ensures per-step stability but does not prescribe global scaling with token budget, which this work addresses. However, the paper does not compare against other adaptive batch size heuristics (e.g., Smith et al.'s linear scaling rules) in the main experiments, limiting claims of practical superiority.

“The $\mu$P framework... achieves the worst performance. This result demonstrates the limitation of the $\mu$P framework, which ignores changes of batch size and sequence length”
paper · Figure 3
“Unlike prior hyperparameter transfer works that focus on local, per-step stability governing early training behavior, we analyze SCG methods under a $\mu$-KL condition and derive convergence guarantees that explicitly depend on the batch size B, sequence length S, and total token budget T”
paper · Section 2
Reproducibility

The experiments use the publicly available modded-nanogpt codebase and FineWeb dataset, and the paper provides specific hyperparameters in Appendix A. However, reproducing the exact results requires fitting power-law models to estimate $L$, $\mu$, and $\rho$ for new architectures (Section 6.4), which lacks automated procedures and depends on robust linear regression with Huber loss that may be sensitive to training dynamics. No code repository URL is provided in the text, and several experimental details (e.g., exact optimizer configurations for Scion radii) rely on external references to Pethick et al. [2025a].

“Details are given in Appendix A... For Scion, we adopt the recommended operator norms... radius $\eta=3000$ for sign-updated layers and $\eta=50$ for matrix-type layers”
paper · Appendix A
“We conduct the estimation procedure... and fit a shifted power law... The estimation procedure is carried out as follows: Smoothness constant L... KL condition constant $\mu$... Norm-equivalence constant $\rho$”
paper · Section 6.4
Abstract

We study the role of batch size in stochastic conditional gradient methods under a $\mu$-Kurdyka-{\L}ojasiewicz ($\mu$-KL) condition. Focusing on momentum-based stochastic conditional gradient algorithms (e.g., Scion), we derive a new analysis that explicitly captures the interaction between stepsize, batch size, and stochastic noise. Our study reveals a regime-dependent behavior: increasing the batch size initially improves optimization accuracy but, beyond a critical threshold, the benefits saturate and can eventually degrade performance under a fixed token budget. Notably, the theory predicts the magnitude of the optimal stepsize and aligns well with empirical practices observed in large-scale training. Leveraging these insights, we derive principled guidelines for selecting the batch size and stepsize, and propose an adaptive strategy that increases batch size and sequence length during training while preserving convergence guarantees. Experiments on NanoGPT are consistent with the theoretical predictions and illustrate the emergence of the predicted scaling regimes. Overall, our results provide a theoretical framework for understanding batch size scaling in stochastic conditional gradient methods and offer guidance for designing efficient training schedules in large-scale optimization.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.