Rethinking Plasticity in Deep Reinforcement Learning

cs.LG cs.AI Zhiqiang He · Mar 22, 2026
Local to this browser
What it does
This paper reframes plasticity loss in deep reinforcement learning as an optimization pathology rather than capacity degradation. The core claim—dubbed the Optimization-Centric Plasticity (OCP) hypothesis—is that parameters become trapped...
Why it matters
The core claim—dubbed the Optimization-Centric Plasticity (OCP) hypothesis—is that parameters become trapped in local optima from previous tasks, which then become poor optima for new tasks. The authors prove that neuron dormancy is...
Main concern
The paper offers a compelling theoretical reframing of plasticity loss through the lens of optimization dynamics rather than descriptive metrics. The equivalence between dormancy and zero-gradient states provides a rigorous foundation, and...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper reframes plasticity loss in deep reinforcement learning as an optimization pathology rather than capacity degradation. The core claim—dubbed the Optimization-Centric Plasticity (OCP) hypothesis—is that parameters become trapped in local optima from previous tasks, which then become poor optima for new tasks. The authors prove that neuron dormancy is mathematically equivalent to zero-gradient states and show that plasticity recovers when tasks differ sufficiently, suggesting networks retain capacity but lose it to task-specific optimization landscapes.

Critical review
Verdict
Bottom line

The paper offers a compelling theoretical reframing of plasticity loss through the lens of optimization dynamics rather than descriptive metrics. The equivalence between dormancy and zero-gradient states provides a rigorous foundation, and the task-specific recovery experiments strongly support the claim that capacity remains intact but is trapped. However, the work is undermined by a mismatch between theory (which assumes smooth activations) and experiments (which use ReLU), along with some experimental designs that conflate dormancy with architectural changes.

“plasticity loss arises because optimal points from previous tasks become poor local optima for new tasks, trapping parameters during task transitions and hindering subsequent learning”
paper · Abstract
What holds up

The theoretical framework linking dormancy to zero gradients is cleanly derived. Theorem 1 establishes that a neuron with dormancy index $s_{l,i}=0$ must have zero gradient $\nabla h_{l,i}(\mathbf{x})=0$ for all $\mathbf{x}\in D$, and conversely, zero gradient plus a single zero activation implies dormancy. This is validated empirically in Figure 4, which shows strong correlation between dormant neurons and zero-gradient neurons, with the OverLeap metric confirming that "once a neuron enters a dormant or zero-gradient state, it becomes irrecoverable." The task-switching experiment (Section 3.2) provides the strongest evidence: networks with high dormancy rates performed comparably to randomly initialized networks when switched to a "significantly different" regression task, directly supporting the claim that "plasticity loss is not a fundamental issue."

“Then the following two statements are equivalent: (A) Dormancy: $s_{l,i}=0$... (B) Zero Gradient (on $D$) plus at least one zero activation”
paper · Theorem 1
“networks trained for 20 million steps with a high proportion of dormant neurons... exhibited similar learning performance on the regression task... indicating that plasticity loss is not a fundamental issue”
paper · Section 3.2
Main concerns

A critical gap exists between theory and practice: Theorem 1 assumes continuously differentiable activations (Assumption 1: $h_{l,i}\in C^{1}$), yet the experiments use ReLU, which is not differentiable at zero. This undermines the theoretical justification for ReLU dormancy. The introductory experiment (Figure 1) comparing PPO-ReLU with PPO-No-Act is misleading—removing activation functions eliminates network nonlinearity entirely, so the observation that "higher dormancy ratio corresponds to a faster convergence rate" conflates dormancy with architectural capacity rather than isolating plasticity effects. Additionally, the third contribution claims "significant reduction of plasticity loss with gradient-free optimization methods," but no such experiments appear in the provided text, suggesting either missing sections or unsubstantiated claims.

“$h_{l,i}(\mathbf{x})$ is continuously differentiable with respect to $\mathbf{x}$, i.e., $h_{l,i}\in C^{1}(\mathbb{R}^{n_{l}})$”
paper · Assumption 1
“PPO-No-Act lacks activation functions, while PPO-ReLu uses ReLu. PPO-ReLu achieves faster convergence with higher dormancy rates”
paper · Figure 1 caption
Evidence and comparison

The paper effectively critiques existing descriptive metrics—dormant neurons (Sokar et al.), effective rank (Gulcehre et al.), and loss landscape characteristics (Lyle et al.)—arguing they "fail to explain the underlying optimization dynamics." The OCP hypothesis successfully explains why parameter constraints (weight clipping, regenerative regularization) improve plasticity: they "prevent deep entrenchment in local optima." However, the paper lacks direct experimental comparison to recent plasticity restoration methods like Continual Backpropagation or ReDO, which would strengthen claims about practical superiority. The evidence for "diverse non-stationary scenarios" is thin, with HalfCheetah-v4 dominating the presented results.

“While these metrics are merely a summary of certain phenomena, they do not fully explain the fundamental mechanisms driving plasticity loss”
paper · Section 1
“regularization methods that constrain the parameter space can effectively mitigate plasticity loss: by preventing parameters from becoming too deeply entrenched in local optima for any single task”
paper · Section 3.1
Reproducibility

Hyperparameters are documented in Appendix B with reasonable detail: PPO with learning rate $1\times 10^{-4}$, $\gamma=0.99$, GAE $\lambda=0.95$, 32 minibatches, clipping coefficient 0.2, and value loss coefficient 0.5. The regression task is explicitly defined by Equation 15 in Appendix A with the exact functional form including $y=2.5X_{0}-1.2X_{1}^{2}+\dots+\varepsilon$. However, no code repository URL is provided, and the criteria for classifying neurons as dormant (thresholds for dormancy index) or zero-gradient (numerical tolerance) are not explicitly stated in the main text. The claim of validation across "diverse non-stationary scenarios" is not supported by the provided figures, which focus primarily on HalfCheetah-v4.

“We have chosen to use the PPO algorithm... annealed learning rate starting at $1\times 10^{-4}$ and a weight decay of $1\times 10^{-4}$... gamma discount factor of 0.99 and Generalized Advantage Estimation (GAE) with $\lambda=0.95$”
paper · Appendix B
“$y=2.5X_{0}-1.2X_{1}^{2}+0.8\sin(X_{2})+1.5\cos(X_{3})+0.7X_{4}X_{5}-0.3X_{6}^{3}+e^{-0.1X_{7}^{2}}+1.1X_{8}-0.5X_{9}^{2}+0.9\tanh(X_{10})+0.2X_{11}^{2}-0.6\sqrt{|X_{12}|}+0.5X_{13}X_{14}-0.4X_{15}+0.3X_{16}+\varepsilon$”
paper · Appendix A, Eq. 15
Abstract

This paper investigates the fundamental mechanisms driving plasticity loss in deep reinforcement learning (RL), a critical challenge where neural networks lose their ability to adapt to non-stationary environments. While existing research often relies on descriptive metrics like dormant neurons or effective rank, these summaries fail to explain the underlying optimization dynamics. We propose the Optimization-Centric Plasticity (OCP) hypothesis, which posits that plasticity loss arises because optimal points from previous tasks become poor local optima for new tasks, trapping parameters during task transitions and hindering subsequent learning. We theoretically establish the equivalence between neuron dormancy and zero-gradient states, demonstrating that the absence of gradient signals is the primary driver of dormancy. Our experiments reveal that plasticity loss is highly task-specific; notably, networks with high dormancy rates in one task can achieve performance parity with randomly initialized networks when switched to a significantly different task, suggesting that the network's capacity remains intact but is inhibited by the specific optimization landscape. Furthermore, our hypothesis elucidates why parameter constraints mitigate plasticity loss by preventing deep entrenchment in local optima. Validated across diverse non-stationary scenarios, our findings provide a rigorous optimization-based framework for understanding and restoring network plasticity in complex RL domains.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.