On the Interplay of Priors and Overparametrization in Bayesian Neural Network Posteriors
Bayesian neural networks (BNNs) suffer from fragmented, high-dimensional posteriors due to weight-space symmetries, raising doubts about the practicality of sampling-based inference. This paper demonstrates that overparametrization—using more hidden units than necessary—actually transforms the posterior geometry in beneficial ways. The authors identify three key phenomena induced by redundancy: balancedness (norm equalization across layers), weight reallocation on equal-probability manifolds (following Dirichlet distributions), and prior conformity (marginals aligning with zero-mean Gaussian priors). Through theory for ReLU networks and extensive experiments with up to 10 million posterior samples, the work explains why recent sampling methods succeed and provides a principled foundation for understanding weight priors in overparametrized regimes.
The paper presents a compelling theoretical framework connecting overparametrization to well-structured BNN posteriors, though its scope is deliberately limited to fully-connected ReLU architectures. The core insight—that redundancy induces balancedness and prior conformity via symmetry-induced equal-probability manifolds—is both novel and well-supported. The extensive empirical validation with large sampling budgets (exceeding 10 million samples) convincingly demonstrates that overparametrized posteriors exhibit smoother, more prior-aligned marginals compared to their fragmented underparametrized counterparts. However, the reliance on tube-conditioning assumptions and the restriction to specific network types temper the generality of the claims.
The mathematical connection between L2 regularization (isotropic Gaussian priors) and balancedness across network layers is rigorously established. Theorem 2 shows that at stationarity, adjacent layers satisfy $\tau_{l}^{-2}\mathbb{E}_{\pi}[\|\mathbf{W}_{l}\|_{F}^{2}] - \tau_{l+1}^{-2}\mathbb{E}_{\pi}[\|\mathbf{W}_{l+1}\|_{F}^{2}] = d_{l} - d_{l+1}$, extending optimization literature results (Du et al., 2018) into the Bayesian sampling regime. The derivation of Dirichlet-distributed reallocation coefficients on minimum-norm manifolds (Theorem 1 and Corollary 1) provides a clean probabilistic characterization of how redundant neurons distribute weight norms. Empirically, the visualization of posterior marginals in Figures 1, 2, and 3 strongly validates the theoretical predictions: underparametrized models show fragmented, multimodal posteriors while overparametrized models exhibit smooth, zero-centered Gaussian-like marginals.
The theoretical analysis relies on several restrictive assumptions that limit practical applicability. The tube-conditioning framework (Definition 2) requires manifold regularity, volume factorization, and weak limit existence (Assumptions 2-4 in Appendix B), which are stated but not verified for specific ReLU architectures beyond the shallow case. The restriction to fully-connected networks ignores the challenges posed by skip connections, batch normalization, or attention mechanisms where symmetries interact differently. While the authors empirically test CNNs and note similar patterns, the theoretical guarantees do not extend to these architectures. Additionally, the analysis focuses on local geometry around minimum-norm manifolds ($\mathcal{M}$) and does not characterize the full marginal posterior shape, potentially overstating the extent of "prior conformity" in practical settings where samples may not concentrate tightly on $\mathcal{M}$.
The comparison to variational inference literature is fair and well-contextualized. The paper correctly contrasts its findings with Trippe and Turner (2017), noting that while variational methods may induce conditional independence between parameters and data in overparametrized regimes, sampling-based inference preserves complex correlation structures (Figure 8). The connection to optimization literature—particularly Du et al. (2018) on algorithmic balancedness and Kim et al. (2025) on loss landscape convexity—is appropriately credited and extended. However, the paper could engage more deeply with infinite-width BNN theory (Neural Tangent Kernel regime), which also studies overparametrization but reaches different conclusions about posterior behavior. The experimental evidence is strong for the studied regimes but limited primarily to UCI tabular datasets; the single CIFAR-10 result in Appendix E is insufficient to validate claims about scalability to modern deep learning settings.
The paper demonstrates strong reproducibility practices. The authors provide detailed experimental protocols in Appendix D, including dataset specifications (UCI repositories), hyperparameter settings, and performance metrics (LPPD). Software dependencies are clearly specified (JAX, BlackJAX, NumPyro), and the code is available via anonymous links during review. The compute infrastructure (GPUs) is described, and sampling budgets (up to 10 million samples, 10,000 independent chains) are explicitly stated. The checklist confirms that all training details, data splits, and random seed handling are documented. The only barrier to full reproduction is computational: replicating the 10-million-sample experiments requires significant GPU resources, though the paper provides trajectory analyses suggesting convergence with fewer chains (10-20) for practical purposes.
Bayesian neural network (BNN) posteriors are often considered impractical for inference, as symmetries fragment them, non-identifiabilities inflate dimensionality, and weight-space priors are seen as meaningless. In this work, we study how overparametrization and priors together reshape BNN posteriors and derive implications allowing us to better understand their interplay. We show that redundancy introduces three key phenomena that fundamentally reshape the posterior geometry: balancedness, weight reallocation on equal-probability manifolds, and prior conformity. We validate our findings through extensive experiments with posterior sampling budgets that far exceed those of earlier works, and demonstrate how overparametrization induces structured, prior-aligned weight posterior distributions.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.