Model Evolution Under Zeroth-Order Optimization: A Neural Tangent Kernel Perspective
Zeroth-order (ZO) optimization enables memory-efficient training via forward-only gradient estimation, but its stochastic nature obscures training dynamics compared to well-characterized first-order (FO) methods. This paper introduces the Neural Zeroth-order Kernel (NZK) to describe model evolution in function space under ZO updates, proving that the expected NZK remains time-invariant for linear models and depends explicitly on the moments of random perturbation directions. The work extends to linearized neural networks and proposes using a single shared random vector to accelerate convergence, with experiments on synthetic and real-world datasets (MNIST, CIFAR-10, Tiny ImageNet) validating the theoretical predictions.
The paper makes a solid theoretical contribution by adapting NTK theory to ZO optimization through the NZK framework. The derivation that $\mathbb{E}_{\bm{\zeta}, \bm{z}}\bm{K}_{\bm{\zeta}, \bm{z}}(\bm{x}_{i}, \bm{x}_{j})$ stays constant during training (Theorem 1) provides the first clear function-space characterization of ZO dynamics, and the closed-form evolution under squared loss (Eq. 7) parallels classical NTK results. However, the scope is restricted to linear models and linearized (infinite-width) networks, leaving practical finite-width architectures outside the theoretical guarantees, and the analysis assumes squared loss without addressing other common objectives.
The NZK definition (Eq. 5) and the proof of time-invariance for its expectation (Theorem 1) are rigorous and well-executed, successfully decomposing the kernel into terms involving $\sigma^{2}_{\bm{\zeta}}$, $\sigma^{2}_{\bm{z}}$, $\mu_{\bm{\zeta}}$, and $\mu_{\bm{z}}$. Corollary 2's result that sharing random vectors ($\bm{\zeta} = \bm{z}$) scales the expected kernel by $(d+2)\sigma_{\bm{z}}^{4}$ (Eq. 9) offers a concrete, theoretically-grounded acceleration strategy. The experiments on 2D linear models (Figure 1) cleanly validate that identical sampling accelerates convergence without requiring increased variance, matching the theoretical predictions.
The theoretical analysis is limited to linear models and linearized neural networks, which assumes infinite width and effectively linearizes the dynamics. While Appendix A.5 discusses homogeneous activation properties to bridge the gap to finite-width networks, this remains a heuristic extension rather than a proven result for standard architectures. The closed-form dynamics rely exclusively on squared loss ($\mathcal{L}(f, f^{*}) = \frac{1}{2}(f - f^{*})^{2}$), with no characterization for cross-entropy or other losses dominant in deep learning. Furthermore, the expectation over random directions assumes sufficiently large batch sizes for the law of large numbers to apply, yet the paper does not quantify finite-sample variance or how practical batch sizes affect the NZK approximation.
The experimental validation appropriately focuses on confirming theoretical predictions rather than claiming state-of-the-art performance. Figure 2 demonstrates that kernel gradient ZO with shared random directions outperforms traditional parametric ZO on CIFAR-10 and Tiny ImageNet, supporting the practical utility of the NZK perspective. However, comparisons to other ZO acceleration techniques (e.g., variance reduction methods) are absent. The paper adequately cites NTK foundations (Jacot et al., 2018; Lee et al., 2019) and positions NZK as the function-space analogue for zeroth-order methods, though it would benefit from deeper discussion of how NZK relates to other recent ZO convergence analyses in parameter space.
Reproducibility is facilitated by the availability of code (link announced in the abstract and introduction). The main text provides key hyperparameters: learning rate $\eta = 1e-3$, perturbation scale $\epsilon = 1e-3$, and 10,000 random samples for kernel estimation in the 2D experiments. However, architectural specifications for the linearized neural networks (width, depth, activation functions) used in CIFAR-10 and Tiny ImageNet experiments are not detailed in the main text, though referenced to Appendix C. The theoretical derivations in Appendices A and B are self-contained and verifiable, with proofs relying on standard trace operations (Lemma 6) and expectations of Gaussian random variables.
Zeroth-order (ZO) optimization enables memory-efficient training of neural networks by estimating gradients via forward passes only, eliminating the need for backpropagation. However, the stochastic nature of gradient estimation significantly obscures the training dynamics, in contrast to the well-characterized behavior of first-order methods under Neural Tangent Kernel (NTK) theory. To address this, we introduce the Neural Zeroth-order Kernel (NZK) to describe model evolution in function space under ZO updates. For linear models, we prove that the expected NZK remains constant throughout training and depends explicitly on the first and second moments of the random perturbation directions. This invariance yields a closed-form expression for model evolution under squared loss. We further extend the analysis to linearized neural networks. Interpreting ZO updates as kernel gradient descent via NZK provides a novel perspective for potentially accelerating convergence. Extensive experiments across synthetic and real-world datasets (including MNIST, CIFAR-10, and Tiny ImageNet) validate our theoretical results and demonstrate acceleration when using a single shared random vector.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.