Rethinking Plasticity in Deep Reinforcement Learning
This paper reframes plasticity loss in deep reinforcement learning as an optimization pathology rather than capacity degradation. The core claim—dubbed the Optimization-Centric Plasticity (OCP) hypothesis—is that parameters become trapped in local optima from previous tasks, which then become poor optima for new tasks. The authors prove that neuron dormancy is mathematically equivalent to zero-gradient states and show that plasticity recovers when tasks differ sufficiently, suggesting networks retain capacity but lose it to task-specific optimization landscapes.
The paper offers a compelling theoretical reframing of plasticity loss through the lens of optimization dynamics rather than descriptive metrics. The equivalence between dormancy and zero-gradient states provides a rigorous foundation, and the task-specific recovery experiments strongly support the claim that capacity remains intact but is trapped. However, the work is undermined by a mismatch between theory (which assumes smooth activations) and experiments (which use ReLU), along with some experimental designs that conflate dormancy with architectural changes.
The theoretical framework linking dormancy to zero gradients is cleanly derived. Theorem 1 establishes that a neuron with dormancy index $s_{l,i}=0$ must have zero gradient $\nabla h_{l,i}(\mathbf{x})=0$ for all $\mathbf{x}\in D$, and conversely, zero gradient plus a single zero activation implies dormancy. This is validated empirically in Figure 4, which shows strong correlation between dormant neurons and zero-gradient neurons, with the OverLeap metric confirming that "once a neuron enters a dormant or zero-gradient state, it becomes irrecoverable." The task-switching experiment (Section 3.2) provides the strongest evidence: networks with high dormancy rates performed comparably to randomly initialized networks when switched to a "significantly different" regression task, directly supporting the claim that "plasticity loss is not a fundamental issue."
A critical gap exists between theory and practice: Theorem 1 assumes continuously differentiable activations (Assumption 1: $h_{l,i}\in C^{1}$), yet the experiments use ReLU, which is not differentiable at zero. This undermines the theoretical justification for ReLU dormancy. The introductory experiment (Figure 1) comparing PPO-ReLU with PPO-No-Act is misleading—removing activation functions eliminates network nonlinearity entirely, so the observation that "higher dormancy ratio corresponds to a faster convergence rate" conflates dormancy with architectural capacity rather than isolating plasticity effects. Additionally, the third contribution claims "significant reduction of plasticity loss with gradient-free optimization methods," but no such experiments appear in the provided text, suggesting either missing sections or unsubstantiated claims.
The paper effectively critiques existing descriptive metrics—dormant neurons (Sokar et al.), effective rank (Gulcehre et al.), and loss landscape characteristics (Lyle et al.)—arguing they "fail to explain the underlying optimization dynamics." The OCP hypothesis successfully explains why parameter constraints (weight clipping, regenerative regularization) improve plasticity: they "prevent deep entrenchment in local optima." However, the paper lacks direct experimental comparison to recent plasticity restoration methods like Continual Backpropagation or ReDO, which would strengthen claims about practical superiority. The evidence for "diverse non-stationary scenarios" is thin, with HalfCheetah-v4 dominating the presented results.
Hyperparameters are documented in Appendix B with reasonable detail: PPO with learning rate $1\times 10^{-4}$, $\gamma=0.99$, GAE $\lambda=0.95$, 32 minibatches, clipping coefficient 0.2, and value loss coefficient 0.5. The regression task is explicitly defined by Equation 15 in Appendix A with the exact functional form including $y=2.5X_{0}-1.2X_{1}^{2}+\dots+\varepsilon$. However, no code repository URL is provided, and the criteria for classifying neurons as dormant (thresholds for dormancy index) or zero-gradient (numerical tolerance) are not explicitly stated in the main text. The claim of validation across "diverse non-stationary scenarios" is not supported by the provided figures, which focus primarily on HalfCheetah-v4.
This paper investigates the fundamental mechanisms driving plasticity loss in deep reinforcement learning (RL), a critical challenge where neural networks lose their ability to adapt to non-stationary environments. While existing research often relies on descriptive metrics like dormant neurons or effective rank, these summaries fail to explain the underlying optimization dynamics. We propose the Optimization-Centric Plasticity (OCP) hypothesis, which posits that plasticity loss arises because optimal points from previous tasks become poor local optima for new tasks, trapping parameters during task transitions and hindering subsequent learning. We theoretically establish the equivalence between neuron dormancy and zero-gradient states, demonstrating that the absence of gradient signals is the primary driver of dormancy. Our experiments reveal that plasticity loss is highly task-specific; notably, networks with high dormancy rates in one task can achieve performance parity with randomly initialized networks when switched to a significantly different task, suggesting that the network's capacity remains intact but is inhibited by the specific optimization landscape. Furthermore, our hypothesis elucidates why parameter constraints mitigate plasticity loss by preventing deep entrenchment in local optima. Validated across diverse non-stationary scenarios, our findings provide a rigorous optimization-based framework for understanding and restoring network plasticity in complex RL domains.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.