Gradient Descent with Projection Finds Over-Parameterized Neural Networks for Learning Low-Degree Polynomials with Nearly Minimax Optimal Rate

stat.ML cs.LG math.ST stat.TH Yingzhen Yang, Ping Li · Mar 22, 2026
Local to this browser
What it does
This paper studies nonparametric regression for learning degree-$k_0$ spherical polynomials on the unit sphere $\mathbb{S}^{d-1}$ using over-parameterized two-layer neural networks. The authors propose a novel Gradient Descent with...
Why it matters
The authors propose a novel Gradient Descent with Projection (GDP) algorithm that constrains learning to the top $r_0 = \Theta(d^{k_0})$ eigenspaces of the Neural Tangent Kernel (NTK). The main result establishes a nearly minimax optimal...
Main concern
The paper delivers a technically sophisticated analysis demonstrating that projected gradient descent can achieve nearly optimal statistical rates for learning low-degree spherical polynomials. The GDP algorithm explicitly exploits the...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper studies nonparametric regression for learning degree-$k_0$ spherical polynomials on the unit sphere $\mathbb{S}^{d-1}$ using over-parameterized two-layer neural networks. The authors propose a novel Gradient Descent with Projection (GDP) algorithm that constrains learning to the top $r_0 = \Theta(d^{k_0})$ eigenspaces of the Neural Tangent Kernel (NTK). The main result establishes a nearly minimax optimal risk bound of order $\log(4/\delta) \cdot \Theta(d^{k_0}/n)$, improving the sample complexity from previous polynomial-in-$1/\varepsilon$ rates to linear $1/\varepsilon$ scaling.

Critical review
Verdict
Bottom line

The paper delivers a technically sophisticated analysis demonstrating that projected gradient descent can achieve nearly optimal statistical rates for learning low-degree spherical polynomials. The GDP algorithm explicitly exploits the low-rank structure of the target function class by projecting onto the principal $r_0$-dimensional subspace of the RKHS, yielding the first nearly minimax optimal risk bound with algorithmic guarantees for this problem. As stated in Theorem 3.1, the trained network achieves "a nearly optimal rate of the nonparametric regression risk of the order $\log(4/\delta) \cdot \Theta(d^{k_0}/n)$ with probability at least $1-\delta$."

“a nearly optimal rate of the nonparametric regression risk of the order log(4/δ) · Θ(d^{k_0}/n) with probability at least 1 − δ”
Yang and Li, Theorem 3.1 · Section 3.1
What holds up

The sample complexity improvement is substantial and well-documented: the paper achieves $n \asymp \Theta(\log(4/\delta)\cdot d^{k_0}/\varepsilon)$ compared to the prior "representative sample complexity $\Theta(d^{k_0} \max\{\varepsilon^{-2}, \log d\})$" required by Nichani et al. (2022). The decomposition of the network function into a structured component in $\mathcal{H}_K(B_h) \cap \mathcal{H}_{S,r_0}$ and a small $L_\infty$ error (Theorem 4.2) provides a rigorous foundation for the analysis. The adaptive degree selection algorithm (Algorithm 2) is a practical theoretical contribution that maintains the optimal rate without prior knowledge of $k_0$.

“sample complexity of n ≍ Θ(log(4/δ)·d^{k_0}/ε) ... in contrast with the representative sample complexity Θ(d^{k_0} max{ε^{-2}, log d})”
Yang and Li, Abstract · Abstract
Main concerns

The analysis relies on a non-standard "augmented feature" map $F(W(0),x)$ added to the network in equation (1), which ensures the NTK has strictly positive eigenvalues for odd-degree spherical harmonics; otherwise standard ReLU NTKs have zero eigenvalues for degrees $2t+1$ with $t \geq 1$ (Section 2.1). This architectural modification is crucial for the theory but represents a departure from standard practice. Additionally, the projection step requires the eigendecomposition of the $n \times n$ empirical NTK matrix, incurring $\Theta(n^3)$ cost. The width requirement is extremely demanding: Theorem 3.1 requires "the network width $m$ satisfies $m \gtrsim (n/d^{k_0})^{25/2} d^{5/2}$," which implies an impractical polynomial degree of 12.5. Finally, the sample size condition $n \geq \Theta(\log(4/\delta)\cdot d^{2k_0})$ is quadratic in the effective dimension.

“the network width m satisfies m ≳ (n/d^{k_0})^{25/2} d^{5/2}”
Yang and Li, Theorem 3.1 · Theorem 3.1, Equation (10)
“Such additional feature map ensures that the NTK associated with the network (1) is a PSD kernel K, to be defined in (2), and all of its eigenvalues are strictly positive”
Yang and Li, Section 2.1 · Section 2.1
Evidence and comparison

The comparison with prior work in Table 1 accurately situates the contribution alongside Nichani et al. (2022), Damian et al. (2022), and Ghorbani et al. (2021). The paper correctly notes that its rate is "nearly unimprovable since the trained network renders a nearly optimal rate ... the minimax optimal rate for the regression risk with a kernel of rank $\Theta(d^{k_0})$ is $\Theta(d^{k_0}/n)$" (Raskutti et al., 2012). While this lower bound is for kernel methods, the paper demonstrates that the neural network trained via GDP achieves this bound, effectively matching the performance of the optimal kernel estimator while using a gradient-based training procedure.

“nearly unimprovable since ... the minimax optimal rate for the regression risk with a kernel of rank Θ(d^{k_0}) is Θ(d^{k_0}/n)”
Yang and Li, Abstract · Abstract
Reproducibility

Algorithm 1 and Algorithm 2 are specified with clear update rules and hyperparameters ($\eta = \Theta(1)$, $T \asymp n/d^{k_0}$). However, reproducibility faces practical barriers: no code repository is mentioned, and the projection operator $P^{(r_0)} = U\Sigma^{(r)}U^\top$ requires computing eigenvectors of the empirical NTK matrix $K_n$, which is $\Theta(n^3)$. The width requirement "$m \gtrsim (n/d^{k_0})^{25/2} d^{5/2}$" is extraordinarily large and likely impossible to satisfy in practice. The initialization scheme using symmetric pairs with opposite signs is clearly specified but the overall experimental cost remains prohibitive.

“m ≳ (n/d^{k_0})^{25/2} d^{5/2}”
Yang and Li, Theorem 3.1 · Theorem 3.1
Abstract

We study the problem of learning a low-degree spherical polynomial of degree $k_0 = \Theta(1) \ge 1$ defined on the unit sphere in $\RR^d$ by training an over-parameterized two-layer neural network with augmented feature in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\eps \in (0, \Theta(d^{-k_0})]$, an over-parameterized two-layer neural network trained by a novel Gradient Descent with Projection (GDP) requires a sample complexity of $n \asymp \Theta( \log(4/\delta) \cdot d^{k_0}/\eps)$ with probability $1-\delta$ for $\delta \in (0,1)$, in contrast with the representative sample complexity $\Theta(d^{k_0} \max\set{\eps^{-2},\log d})$. Moreover, such sample complexity is nearly unimprovable since the trained network renders a nearly optimal rate of the nonparametric regression risk of the order $\log({4}/{\delta}) \cdot \Theta(d^{k_0}/{n})$ with probability at least $1-\delta$. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $\Theta(d^{k_0})$ is $\Theta(d^{k_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GDP is nearly minimax optimal. In the case that the ground truth degree $k_0$ is unknown, we present a novel and provable adaptive degree selection algorithm which identifies the true degree and achieves the same nearly optimal regression rate. To the best of our knowledge, this is the first time that a nearly optimal risk bound is obtained by training an over-parameterized neural network with a popular activation function (ReLU) and algorithmic guarantee for learning low-degree spherical polynomials. Due to the feature learning capability of GDP, our results are beyond the regular Neural Tangent Kernel (NTK) limit.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.