Model Evolution Under Zeroth-Order Optimization: A Neural Tangent Kernel Perspective

cs.LG Chen Zhang, Yuxin Cheng, Chenchen Ding, Shuqi Wang, Jingreng Lei, Runsheng Yu, Yik-Chung WU, Ngai Wong · Mar 22, 2026
Local to this browser
What it does
Zeroth-order (ZO) optimization enables memory-efficient training via forward-only gradient estimation, but its stochastic nature obscures training dynamics compared to well-characterized first-order (FO) methods. This paper introduces the...
Why it matters
This paper introduces the Neural Zeroth-order Kernel (NZK) to describe model evolution in function space under ZO updates, proving that the expected NZK remains time-invariant for linear models and depends explicitly on the moments of...
Main concern
The paper makes a solid theoretical contribution by adapting NTK theory to ZO optimization through the NZK framework. The derivation that $\mathbb{E}_{\bm{\zeta}, \bm{z}}\bm{K}_{\bm{\zeta}, \bm{z}}(\bm{x}_{i}, \bm{x}_{j})$ stays constant...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Zeroth-order (ZO) optimization enables memory-efficient training via forward-only gradient estimation, but its stochastic nature obscures training dynamics compared to well-characterized first-order (FO) methods. This paper introduces the Neural Zeroth-order Kernel (NZK) to describe model evolution in function space under ZO updates, proving that the expected NZK remains time-invariant for linear models and depends explicitly on the moments of random perturbation directions. The work extends to linearized neural networks and proposes using a single shared random vector to accelerate convergence, with experiments on synthetic and real-world datasets (MNIST, CIFAR-10, Tiny ImageNet) validating the theoretical predictions.

Critical review
Verdict
Bottom line

The paper makes a solid theoretical contribution by adapting NTK theory to ZO optimization through the NZK framework. The derivation that $\mathbb{E}_{\bm{\zeta}, \bm{z}}\bm{K}_{\bm{\zeta}, \bm{z}}(\bm{x}_{i}, \bm{x}_{j})$ stays constant during training (Theorem 1) provides the first clear function-space characterization of ZO dynamics, and the closed-form evolution under squared loss (Eq. 7) parallels classical NTK results. However, the scope is restricted to linear models and linearized (infinite-width) networks, leaving practical finite-width architectures outside the theoretical guarantees, and the analysis assumes squared loss without addressing other common objectives.

“the corresponding NZK stays constant throughout training in its expected sense, with its entries closely tied to the distribution of the random direction vectors”
Zhang et al., Theorem 1/Eq. 6 · Section 3.1, Theorem 1
“[f_{\theta_{t}}(\bm{x}_{i})]_{N} = (\bm{I}_{N} - (\bm{I}_{N} - \eta\bm{\bar{\mathcal{K}}}_{\bm{\zeta}, \bm{z}})^{t}) [f^{*}(\bm{x}_{i})]_{N} + (\bm{I}_{N} - \eta\bm{\bar{\mathcal{K}}}_{\bm{\zeta}, \bm{z}})^{t} [f_{\theta_{0}}(\bm{x}_{i})]_{N}”
Zhang et al., Eq. 7 · Section 3.1
What holds up

The NZK definition (Eq. 5) and the proof of time-invariance for its expectation (Theorem 1) are rigorous and well-executed, successfully decomposing the kernel into terms involving $\sigma^{2}_{\bm{\zeta}}$, $\sigma^{2}_{\bm{z}}$, $\mu_{\bm{\zeta}}$, and $\mu_{\bm{z}}$. Corollary 2's result that sharing random vectors ($\bm{\zeta} = \bm{z}$) scales the expected kernel by $(d+2)\sigma_{\bm{z}}^{4}$ (Eq. 9) offers a concrete, theoretically-grounded acceleration strategy. The experiments on 2D linear models (Figure 1) cleanly validate that identical sampling accelerates convergence without requiring increased variance, matching the theoretical predictions.

“If $\bm{\zeta} = \bm{z} \sim \mathcal{N}(\bm{0}, \sigma_{\bm{z}}^{2}\bm{I}_{d})$, we have $\bm{\mathcal{K}}_{\bm{\zeta}, \bm{z}}(\bm{x}_{i}, \bm{x}_{j}) = (d+2)\sigma_{\bm{z}}^{4} \langle \bm{x}_{i}, \bm{x}_{j} \rangle$”
Zhang et al., Corollary 2/Eq. 9 · Section 3.1, Corollary 2
“Identical denotes that $\bm{\zeta}$ remains the same as $\bm{z}$ ... this memory-saving method can also speed up convergence without necessitating an increase in variance”
Zhang et al., Figure 1 caption · Section 4, Figure 1
Main concerns

The theoretical analysis is limited to linear models and linearized neural networks, which assumes infinite width and effectively linearizes the dynamics. While Appendix A.5 discusses homogeneous activation properties to bridge the gap to finite-width networks, this remains a heuristic extension rather than a proven result for standard architectures. The closed-form dynamics rely exclusively on squared loss ($\mathcal{L}(f, f^{*}) = \frac{1}{2}(f - f^{*})^{2}$), with no characterization for cross-entropy or other losses dominant in deep learning. Furthermore, the expectation over random directions assumes sufficiently large batch sizes for the law of large numbers to apply, yet the paper does not quantify finite-sample variance or how practical batch sizes affect the NZK approximation.

“However, practical neural networks are generally of finite width. To reconcile the disparity between theoretical infinite-width neural networks and real-world finite-width ones, the property of homogeneous activation ... can be pivotal”
Zhang et al. · Appendix A.5
“In the case of squared loss, represented as $\mathcal{L}(f, f^{*}) = \frac{1}{2}(f - f^{*})^{2}$”
Zhang et al. · Appendix A.2
Evidence and comparison

The experimental validation appropriately focuses on confirming theoretical predictions rather than claiming state-of-the-art performance. Figure 2 demonstrates that kernel gradient ZO with shared random directions outperforms traditional parametric ZO on CIFAR-10 and Tiny ImageNet, supporting the practical utility of the NZK perspective. However, comparisons to other ZO acceleration techniques (e.g., variance reduction methods) are absent. The paper adequately cites NTK foundations (Jacot et al., 2018; Lee et al., 2019) and positions NZK as the function-space analogue for zeroth-order methods, though it would benefit from deeper discussion of how NZK relates to other recent ZO convergence analyses in parameter space.

“The loss plots shown in Figure 3 tell that updating with traditional parametric gradient ZO converges more slowly than FO, which in turn performs worse than using kernel gradients ZO”
Zhang et al. · Section 4, Figure 2
“NTK theory, however, relies on exact gradients and cannot be directly applied to ZO”
Zhang et al. · Section 1
Reproducibility

Reproducibility is facilitated by the availability of code (link announced in the abstract and introduction). The main text provides key hyperparameters: learning rate $\eta = 1e-3$, perturbation scale $\epsilon = 1e-3$, and 10,000 random samples for kernel estimation in the 2D experiments. However, architectural specifications for the linearized neural networks (width, depth, activation functions) used in CIFAR-10 and Tiny ImageNet experiments are not detailed in the main text, though referenced to Appendix C. The theoretical derivations in Appendices A and B are self-contained and verifiable, with proofs relying on standard trace operations (Lemma 6) and expectations of Gaussian random variables.

“Codes are available at the [LINK]”
Zhang et al. · Abstract
“we maintain a fixed learning rate $\eta = 1e-3$ and iterations for both FO and ZO optimization. For ZO optimization, we set $\epsilon = 1e-3$ ... and estimate $\bm{\mathcal{K}}_{\bm{\zeta}, \bm{z}}(\bm{x}_{i}, \bm{x}_{j})$ by averaging over 10,000 random samples”
Zhang et al. · Section 4
Abstract

Zeroth-order (ZO) optimization enables memory-efficient training of neural networks by estimating gradients via forward passes only, eliminating the need for backpropagation. However, the stochastic nature of gradient estimation significantly obscures the training dynamics, in contrast to the well-characterized behavior of first-order methods under Neural Tangent Kernel (NTK) theory. To address this, we introduce the Neural Zeroth-order Kernel (NZK) to describe model evolution in function space under ZO updates. For linear models, we prove that the expected NZK remains constant throughout training and depends explicitly on the first and second moments of the random perturbation directions. This invariance yields a closed-form expression for model evolution under squared loss. We further extend the analysis to linearized neural networks. Interpreting ZO updates as kernel gradient descent via NZK provides a novel perspective for potentially accelerating convergence. Extensive experiments across synthetic and real-world datasets (including MNIST, CIFAR-10, and Tiny ImageNet) validate our theoretical results and demonstrate acceleration when using a single shared random vector.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.