dynActivation: A Trainable Activation Family for Adaptive Nonlinearity
dynActivation addresses the rigidity of fixed activation functions by introducing per-layer trainable scalars that interpolate between a base nonlinearity and a linear path. The method adds only two parameters per layer ($\alpha_i$ and $\beta_i$) via $f_i(x) = \text{BaseAct}(x)(\alpha_i - \beta_i) + \beta_i x$, allowing adaptive nonlinearity allocation across depth. Results show strong vision benchmarks (+14% on CIFAR-10), robustness to extreme depth scaling (95%+ accuracy on 75-layer MNIST), and faster convergence (24% AUC reduction), though LLM perplexity gains vanish in long-run training.
The paper presents a simple, elegant formulation for trainable activations with compelling empirical results on image classification, demonstrating both accuracy improvements and training efficiency gains. The MNIST depth-scaling experiments provide particularly convincing evidence that learned linearization in deep layers prevents collapse, though the long-run LLM results show only transient benefits and raise questions about asymptotic utility. The work is statistically rigorous in its comparisons but would benefit from deeper theoretical analysis of why specific $(\alpha, \beta)$ values emerge.
The core mechanism—adding a linear path controlled by $\beta_i$ alongside a weighted base activation—is well-motivated and empirically validated across diverse architectures. The MNIST depth-scaling study provides the strongest evidence: "dynAct (Mish base) is the only activation that never drops below 95% across the entire sweep, ranging between 95.3% and 99.3%" while "ReLU collapses below 80% at approximately 25 layers" (Section 6.2). Learned visualizations confirm deep layers naturally converge to near-linear behavior ($\alpha \approx \beta \approx 0.5$), effectively creating learned skip connections. The adversarial robustness gains (7.40 pp advantage over ReLU under FGSM $\epsilon=0.08$) and convergence improvements (24% AUC reduction) are consistent across variants.
The LLM transfer results exhibit a critical limitation: while dynActGLU(Swish) achieves 10.3% relative perplexity reduction at 5,620 steps, "the gap vanishes at 34,300 steps" (Abstract), suggesting benefits are purely convergence acceleration rather than asymptotic improvement. The claim of "up to +54% training efficiency" combines hardware throughput with convergence metrics in a way that may conflate compute overhead with wall-clock improvements. The paper also lacks theoretical justification for why $\bar{\alpha}$ converges to $\approx 0.75$ and $\bar{\beta}$ to $\approx -0.08$—these empirical observations remain unexplained. Additionally, distribution-shift robustness is mixed, with dynActivation falling behind under contrast corruptions (-4.38 pp vs Swish), indicating learned parameters may introduce vulnerability trade-offs.
The evidence robustly supports claims for computer vision tasks. Statistical significance testing (Table 8) shows dynActivation(Mish) significantly outperforms static Mish ($p = 0.0375$) and ReLU ($p = 0.0205$) across 27 activation benchmarks, ranking first at 79.72% accuracy in the extended comparison. The comparison to related work is generally fair, though the authors acknowledge that the advantage over Apa is not statistically significant ($p = 0.2598$). The ablation studies covering initialization, optimizers (27 configurations), and regularization are thorough, though regularization shows only marginal gains (+0.20 pp).
The experimental protocol is well-documented with fixed seeds, explicit hyperparameters (Adam, lr=0.001, batch size 128 for CIFAR), and 5 runs per configuration. However, crucial implementation details are missing: no code repository is provided, and the dynActGLU formulation for LLM experiments lacks implementation specifics beyond the equation. While hardware specifications (RTX 2080 Ti, RTX 5090) are reported, CUDA and PyTorch versions are omitted. The MNIST depth-scaling study uses "two convolutional front-end layers and fully connected layers" but exact layer widths and dropout rates are not specified, which would be needed to reproduce the 75-layer experiments.
This paper proposes $\mathrm{dynActivation}$, a per-layer trainable activation defined as $f_i(x) = \mathrm{BaseAct}(x)(\alpha_i - \beta_i) + \beta_i x$, where $\alpha_i$ and $\beta_i$ are lightweight learned scalars that interpolate between the base nonlinearity and a linear path and $\mathrm{BaseAct}(x)$ resembles any ReLU-like function. The static and dynamic ReLU-like variants are then compared across multiple vision tasks, language modeling tasks, and ablation studies. The results suggest that dynActivation variants tend to linearize deep layers while maintaining high performance, which can improve training efficiency by up to $+54\%$ over ReLU. On CIFAR-10, dynActivation(Mish) improves over static Mish by up to $+14.02\%$ on AttentionCNN with an average improvment by $+6.00\%$, with a $24\%$ convergence-AUC reduction relative to Mish (2120 vs. 2785). In a 1-to-75-layer MNIST depth-scaling study, dynActivation never drops below $95\%$ test accuracy ($95.3$--$99.3\%$), while ReLU collapses below $80\%$ at 25 layers. Under FGSM at $\varepsilon{=}0.08$, dynActivation(Mish) incurs a $55.39\%$ accuracy drop versus $62.79\%$ for ReLU ($7.40\%$ advantage). Transferred to language modeling, a new proposed dynActGLU-variant achieves a $10.3\%$ relative perplexity reduction over SwiGLU at 5620 steps (4.047 vs. 4.514), though the gap vanishes at 34300 steps.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.