Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

cs.LG cs.LG Huanran Chen, Huaqing Zhang, Xiao Li, Yinpeng Dong, Ke Shen, Jun Zhu · Apr 10, 2026
Local to this browser
What it does
This paper investigates the geometric structure of converged states in LLM pretraining, asking whether models converge to a common minimizer across data sources or merely a minimizer of the summed loss. The authors hypothesize that the...
Why it matters
The authors hypothesize that the "closeness" of task-specific minima correlates with downstream generalization, and propose the Nexus optimizer to maximize gradient similarity as a tractable proxy for closeness. Their core finding—that...
Main concern
The paper presents a compelling theoretical framework connecting geometric closeness of minima to generalization, backed by strong empirical results across scales from 130M to 3B parameters. The Nexus algorithm, which uses a dual-loop...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper investigates the geometric structure of converged states in LLM pretraining, asking whether models converge to a common minimizer across data sources or merely a minimizer of the summed loss. The authors hypothesize that the "closeness" of task-specific minima correlates with downstream generalization, and propose the Nexus optimizer to maximize gradient similarity as a tractable proxy for closeness. Their core finding—that identical pretraining loss can mask vastly different downstream performance depending on the implicit bias toward geometric closeness—challenges the prevailing reliance on pretraining loss as the sole evaluation metric.

Critical review
Verdict
Bottom line

The paper presents a compelling theoretical framework connecting geometric closeness of minima to generalization, backed by strong empirical results across scales from 130M to 3B parameters. The Nexus algorithm, which uses a dual-loop normalized SGD mechanism to approximate Hessian-gradient products and maximize gradient similarity, demonstrates consistent downstream improvements without compromising pretraining loss. However, the theoretical analysis relies on quadratic and strongly convex approximations that may not fully capture non-convex deep learning landscapes, and the "same loss" claim allows for small but non-zero differences that complicate the causal interpretation.

“Nexus strictly satisfies the 'same pretraining loss' condition, showing an immaterial difference of 0.004 compared to the baseline.”
Chen et al., Sec. 4.2 · Section 4.2
“This finding challenges the reliance on pretraining loss as the sole proxy for model evaluation and demonstrates the importance of implicit biases in unlocking downstream generalization.”
Chen et al., Abstract · Abstract
What holds up

The geometric intuition is rigorously formalized through Theorems 2.2 and 2.3, which bound downstream loss in terms of closeness for quadratic and strongly convex cases. The engineering adaptation in Algorithm 3 is practical and introduces negligible overhead. Empirical results show consistent improvements: on the 3B model, Nexus reduces the OOD validation loss by 0.012 and yields significant accuracy gains on complex reasoning benchmarks, including a +15.0% improvement on GSM8k, +8.0% on MATH, and +4.0% on HumanEval (Table 1).

“Specifically, it reduces the OOD validation loss by 0.012 and yields significant accuracy gains on complex reasoning benchmarks, including a +15.0% improvement on GSM8k”
Chen et al., Sec. 4.2 · Section 4.2
Main concerns

Several limitations temper the claims. First, the "same pretraining loss" condition actually permits small differences ($\Delta<0.01$), and in the 3B experiments Nexus achieves slightly lower validation loss (1.602 vs 1.606) than AdamW, complicating the claim that improvements come purely from implicit bias rather than marginal loss reduction. Second, the theoretical guarantees assume quadratic or locally strongly convex losses (Theorems 2.2, 2.3), which may not hold in the complex loss landscapes of LLMs. Third, the method is incompatible with the Muon optimizer, suggesting sensitivity to the base optimizer choice. Finally, the comparison with related work on flat minima (SAM, ASAM) is limited, leaving open questions about how closeness relates to flatness in practice.

“Nexus currently remains incompatible with the Muon optimizer.”
Chen et al., Limitations · Section 6
“AdamW Pretrain. Loss: 1.606; Nexus Pretrain. Loss: 1.602”
Chen et al., Table 1 · Table 1, 3B model
Evidence and comparison

The evidence generally supports the core hypothesis that closeness improves generalization, with consistent gains across model scales (130M to 3B) and tasks (MMLU, GSM8k, MATH, HumanEval). The ablation studies effectively isolate the contribution of the multi-step inner loop, showing that $K=1$ (equivalent to normalized gradients) fails to provide benefits, while $K>1$ succeeds. As stated in Section 5.2: "This ablation is strictly equivalent to executing the Nexus algorithm with an inner loop step count of $K=1$... when $K=1$, the coefficient of the gradient similarity regularizer becomes strictly zero." However, the lack of comparison with other multi-task gradient alignment methods and the reliance on a proprietary pretraining dataset limit the assessment of novelty and reproducibility.

“This ablation is strictly equivalent to executing the Nexus algorithm with an inner loop step count of $K=1$... when $K=1$, the coefficient of the gradient similarity regularizer $\frac{K-1}{4K}$ becomes strictly zero”
Chen et al., Sec. 5.2 · Section 5.2
Reproducibility

The paper provides detailed algorithms (Algorithms 1-3) and hyperparameter descriptions, but the primary pretraining dataset is proprietary ("in-house"), with only limited experiments on public datasets. No code repository is mentioned. The engineering adaptation uses standard techniques (inner model cloning, gradient accumulation), but reproducing the 3B parameter experiments would require substantial compute resources and access to the specific data mixture. As noted in Section 4.1, the corpus is "strictly cleaned to ensure no data contamination regarding the evaluated benchmarks," but this proprietary nature limits independent verification.

“We utilize an in-house pretraining dataset... This corpus is: (1) strictly cleaned to ensure no data contamination regarding the evaluated benchmarks or distillation data”
Chen et al., Sec. 4.1 · Section 4.1
Abstract

Pretraining is the cornerstone of Large Language Models (LLMs), dominating the vast majority of computational budget and data to serve as the primary engine for their capabilities. During pretraining, LLMs acquire foundational knowledge from an unprecedentedly massive and diverse data sources, encompassing a vast array of domains such as general language, mathematics, code, and complex reasoning. In this work, we investigate an interesting geometric question regarding the converged state of pretraining: Does the model converge to a common minimizer across all data sources (e.g., \cref{fig:cwa_illustration:close}), or merely a minimizer of the summed loss (e.g., \cref{fig:cwa_illustration:distant})? We hypothesize that the geometric "closeness" of task-specific minima is intrinsically linked to downstream generalization. We reveal that standard optimizers (e.g., AdamW) often converge to points where task-specific minima are distant from each other. To address this, we propose the Nexus optimizer, which encourages the closeness of these minima by maximizing gradient similarity during optimization. Experiments across models ranging from 130M to 3B parameters, various data mixtures and hyperparameter schedules, show that Nexus \textit{significantly boosts downstream performance}, despite \textit{achieving the same pretraining loss} (see \cref{fig:demo:benchmark}). Notably, on the 3B model, Nexus reduces the out-of-distribution loss by 0.012 and yields up to a 15.0\% accuracy improvement on complex reasoning tasks (e.g., GSM8k). This finding challenges the reliance on pretraining loss as the sole proxy for model evaluation and demonstrates the importance of implicit biases in unlocking downstream generalization.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.