Gradient Descent with Projection Finds Over-Parameterized Neural Networks for Learning Low-Degree Polynomials with Nearly Minimax Optimal Rate

stat.ML cs.LG math.ST stat.TH Yingzhen Yang, Ping Li · Mar 22, 2026

What it does

Why it matters

The authors propose a novel Gradient Descent with Projection (GDP) algorithm that constrains learning to the top $r_0 = \Theta(d^{k_0})$ eigenspaces of the Neural Tangent Kernel (NTK). The main result establishes a nearly minimax optimal...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper studies nonparametric regression for learning degree-$k_0$ spherical polynomials on the unit sphere $\mathbb{S}^{d-1}$ using over-parameterized two-layer neural networks. The authors propose a novel Gradient Descent with Projection (GDP) algorithm that constrains learning to the top $r_0 = \Theta(d^{k_0})$ eigenspaces of the Neural Tangent Kernel (NTK). The main result establishes a nearly minimax optimal risk bound of order $\log(4/\delta) \cdot \Theta(d^{k_0}/n)$, improving the sample complexity from previous polynomial-in-$1/\varepsilon$ rates to linear $1/\varepsilon$ scaling.

Critical review

Verdict

Bottom line

The paper delivers a technically sophisticated analysis demonstrating that projected gradient descent can achieve nearly optimal statistical rates for learning low-degree spherical polynomials. The GDP algorithm explicitly exploits the low-rank structure of the target function class by projecting onto the principal $r_0$-dimensional subspace of the RKHS, yielding the first nearly minimax optimal risk bound with algorithmic guarantees for this problem. As stated in Theorem 3.1, the trained network achieves "a nearly optimal rate of the nonparametric regression risk of the order $\log(4/\delta) \cdot \Theta(d^{k_0}/n)$ with probability at least $1-\delta$."

“a nearly optimal rate of the nonparametric regression risk of the order log(4/δ) · Θ(d^{k_0}/n) with probability at least 1 − δ”

Yang and Li, Theorem 3.1 · Section 3.1

What holds up

The sample complexity improvement is substantial and well-documented: the paper achieves $n \asymp \Theta(\log(4/\delta)\cdot d^{k_0}/\varepsilon)$ compared to the prior "representative sample complexity $\Theta(d^{k_0} \max\{\varepsilon^{-2}, \log d\})$" required by Nichani et al. (2022). The decomposition of the network function into a structured component in $\mathcal{H}_K(B_h) \cap \mathcal{H}_{S,r_0}$ and a small $L_\infty$ error (Theorem 4.2) provides a rigorous foundation for the analysis. The adaptive degree selection algorithm (Algorithm 2) is a practical theoretical contribution that maintains the optimal rate without prior knowledge of $k_0$.

“sample complexity of n ≍ Θ(log(4/δ)·d^{k_0}/ε) ... in contrast with the representative sample complexity Θ(d^{k_0} max{ε^{-2}, log d})”

Yang and Li, Abstract · Abstract

Main concerns

The analysis relies on a non-standard "augmented feature" map $F(W(0),x)$ added to the network in equation (1), which ensures the NTK has strictly positive eigenvalues for odd-degree spherical harmonics; otherwise standard ReLU NTKs have zero eigenvalues for degrees $2t+1$ with $t \geq 1$ (Section 2.1). This architectural modification is crucial for the theory but represents a departure from standard practice. Additionally, the projection step requires the eigendecomposition of the $n \times n$ empirical NTK matrix, incurring $\Theta(n^3)$ cost. The width requirement is extremely demanding: Theorem 3.1 requires "the network width $m$ satisfies $m \gtrsim (n/d^{k_0})^{25/2} d^{5/2}$," which implies an impractical polynomial degree of 12.5. Finally, the sample size condition $n \geq \Theta(\log(4/\delta)\cdot d^{2k_0})$ is quadratic in the effective dimension.

“the network width m satisfies m ≳ (n/d^{k_0})^{25/2} d^{5/2}”

Yang and Li, Theorem 3.1 · Theorem 3.1, Equation (10)

“Such additional feature map ensures that the NTK associated with the network (1) is a PSD kernel K, to be defined in (2), and all of its eigenvalues are strictly positive”

Yang and Li, Section 2.1 · Section 2.1

Evidence and comparison

The comparison with prior work in Table 1 accurately situates the contribution alongside Nichani et al. (2022), Damian et al. (2022), and Ghorbani et al. (2021). The paper correctly notes that its rate is "nearly unimprovable since the trained network renders a nearly optimal rate ... the minimax optimal rate for the regression risk with a kernel of rank $\Theta(d^{k_0})$ is $\Theta(d^{k_0}/n)$" (Raskutti et al., 2012). While this lower bound is for kernel methods, the paper demonstrates that the neural network trained via GDP achieves this bound, effectively matching the performance of the optimal kernel estimator while using a gradient-based training procedure.

“nearly unimprovable since ... the minimax optimal rate for the regression risk with a kernel of rank Θ(d^{k_0}) is Θ(d^{k_0}/n)”

Yang and Li, Abstract · Abstract

Reproducibility

Algorithm 1 and Algorithm 2 are specified with clear update rules and hyperparameters ($\eta = \Theta(1)$, $T \asymp n/d^{k_0}$). However, reproducibility faces practical barriers: no code repository is mentioned, and the projection operator $P^{(r_0)} = U\Sigma^{(r)}U^\top$ requires computing eigenvectors of the empirical NTK matrix $K_n$, which is $\Theta(n^3)$. The width requirement "$m \gtrsim (n/d^{k_0})^{25/2} d^{5/2}$" is extraordinarily large and likely impossible to satisfy in practice. The initialization scheme using symmetric pairs with opposite signs is clearly specified but the overall experimental cost remains prohibitive.

“m ≳ (n/d^{k_0})^{25/2} d^{5/2}”

Yang and Li, Theorem 3.1 · Theorem 3.1

Abstract

We study the problem of learning a low-degree spherical polynomial of degree $k_0 = \Theta(1) \ge 1$ defined on the unit sphere in $\RR^d$ by training an over-parameterized two-layer neural network with augmented feature in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\eps \in (0, \Theta(d^{-k_0})]$, an over-parameterized two-layer neural network trained by a novel Gradient Descent with Projection (GDP) requires a sample complexity of $n \asymp \Theta( \log(4/\delta) \cdot d^{k_0}/\eps)$ with probability $1-\delta$ for $\delta \in (0,1)$, in contrast with the representative sample complexity $\Theta(d^{k_0} \max\set{\eps^{-2},\log d})$. Moreover, such sample complexity is nearly unimprovable since the trained network renders a nearly optimal rate of the nonparametric regression risk of the order $\log({4}/{\delta}) \cdot \Theta(d^{k_0}/{n})$ with probability at least $1-\delta$. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $\Theta(d^{k_0})$ is $\Theta(d^{k_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GDP is nearly minimax optimal. In the case that the ground truth degree $k_0$ is unknown, we present a novel and provable adaptive degree selection algorithm which identifies the true degree and achieves the same nearly optimal regression rate. To the best of our knowledge, this is the first time that a nearly optimal risk bound is obtained by training an over-parameterized neural network with a popular activation function (ReLU) and algorithmic guarantee for learning low-degree spherical polynomials. Due to the feature learning capability of GDP, our results are beyond the regular Neural Tangent Kernel (NTK) limit.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.