On the Interplay of Priors and Overparametrization in Bayesian Neural Network Posteriors

cs.LG stat.ML Julius Kobialka, Emanuel Sommer, Chris Kolb, Juntae Kwon, Daniel Dold, David R\"ugamer · Mar 23, 2026
Local to this browser
What it does
Bayesian neural networks (BNNs) suffer from fragmented, high-dimensional posteriors due to weight-space symmetries, raising doubts about the practicality of sampling-based inference. This paper demonstrates that overparametrization—using...
Why it matters
The authors identify three key phenomena induced by redundancy: balancedness (norm equalization across layers), weight reallocation on equal-probability manifolds (following Dirichlet distributions), and prior conformity (marginals...
Main concern
The paper presents a compelling theoretical framework connecting overparametrization to well-structured BNN posteriors, though its scope is deliberately limited to fully-connected ReLU architectures. The core insight—that redundancy...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Bayesian neural networks (BNNs) suffer from fragmented, high-dimensional posteriors due to weight-space symmetries, raising doubts about the practicality of sampling-based inference. This paper demonstrates that overparametrization—using more hidden units than necessary—actually transforms the posterior geometry in beneficial ways. The authors identify three key phenomena induced by redundancy: balancedness (norm equalization across layers), weight reallocation on equal-probability manifolds (following Dirichlet distributions), and prior conformity (marginals aligning with zero-mean Gaussian priors). Through theory for ReLU networks and extensive experiments with up to 10 million posterior samples, the work explains why recent sampling methods succeed and provides a principled foundation for understanding weight priors in overparametrized regimes.

Critical review
Verdict
Bottom line

The paper presents a compelling theoretical framework connecting overparametrization to well-structured BNN posteriors, though its scope is deliberately limited to fully-connected ReLU architectures. The core insight—that redundancy induces balancedness and prior conformity via symmetry-induced equal-probability manifolds—is both novel and well-supported. The extensive empirical validation with large sampling budgets (exceeding 10 million samples) convincingly demonstrates that overparametrized posteriors exhibit smoother, more prior-aligned marginals compared to their fragmented underparametrized counterparts. However, the reliance on tube-conditioning assumptions and the restriction to specific network types temper the generality of the claims.

“redundancy introduces three key phenomena that fundamentally reshape the posterior geometry: balancedness, weight reallocation on equal-probability manifolds, and prior conformity”
Kobialka et al. · Abstract
“A limitation of our work is its theoretical exposition's main focus on fully-connected ReLU networks”
Kobialka et al. · Section 6
What holds up

The mathematical connection between L2 regularization (isotropic Gaussian priors) and balancedness across network layers is rigorously established. Theorem 2 shows that at stationarity, adjacent layers satisfy $\tau_{l}^{-2}\mathbb{E}_{\pi}[\|\mathbf{W}_{l}\|_{F}^{2}] - \tau_{l+1}^{-2}\mathbb{E}_{\pi}[\|\mathbf{W}_{l+1}\|_{F}^{2}] = d_{l} - d_{l+1}$, extending optimization literature results (Du et al., 2018) into the Bayesian sampling regime. The derivation of Dirichlet-distributed reallocation coefficients on minimum-norm manifolds (Theorem 1 and Corollary 1) provides a clean probabilistic characterization of how redundant neurons distribute weight norms. Empirically, the visualization of posterior marginals in Figures 1, 2, and 3 strongly validates the theoretical predictions: underparametrized models show fragmented, multimodal posteriors while overparametrized models exhibit smooth, zero-centered Gaussian-like marginals.

“$\tau_{l}^{-2}\mathbb{E}_{\pi}[\|\mathbf{W}_{l}\|_{F}^{2}] - \tau_{l+1}^{-2}\mathbb{E}_{\pi}[\|\mathbf{W}_{l+1}\|_{F}^{2}] = d_{l} - d_{l+1}$”
Kobialka et al. · Theorem 2
“the coefficients $\bm{\rho}^{(\varpi)} \in \Delta^{k_{\varpi}-1}$ ... follows a symmetric Dirichlet distribution”
Kobialka et al. · Section 4.1
“Sampled marginal bivariate posterior densities ... for an underparametrized (left) and an overparametrized model (right) ... using 10 million posterior samples”
Kobialka et al. · Figure 1 caption
Main concerns

The theoretical analysis relies on several restrictive assumptions that limit practical applicability. The tube-conditioning framework (Definition 2) requires manifold regularity, volume factorization, and weak limit existence (Assumptions 2-4 in Appendix B), which are stated but not verified for specific ReLU architectures beyond the shallow case. The restriction to fully-connected networks ignores the challenges posed by skip connections, batch normalization, or attention mechanisms where symmetries interact differently. While the authors empirically test CNNs and note similar patterns, the theoretical guarantees do not extend to these architectures. Additionally, the analysis focuses on local geometry around minimum-norm manifolds ($\mathcal{M}$) and does not characterize the full marginal posterior shape, potentially overstating the extent of "prior conformity" in practical settings where samples may not concentrate tightly on $\mathcal{M}$.

“Assumption 2 (Manifold regularity). For a fixed assignment $\varsigma$, the minimum-norm manifold $\mathcal{M}_{\varsigma}$ is a closed embedded $C^{2}$ manifold”
Kobialka et al. · Appendix B
“we emphasize, however, that this analysis is local to the neighborhood of $\mathcal{M}$ and does not by itself determine the shape of the full marginal posterior”
Kobialka et al. · Section 4.1
Evidence and comparison

The comparison to variational inference literature is fair and well-contextualized. The paper correctly contrasts its findings with Trippe and Turner (2017), noting that while variational methods may induce conditional independence between parameters and data in overparametrized regimes, sampling-based inference preserves complex correlation structures (Figure 8). The connection to optimization literature—particularly Du et al. (2018) on algorithmic balancedness and Kim et al. (2025) on loss landscape convexity—is appropriately credited and extended. However, the paper could engage more deeply with infinite-width BNN theory (Neural Tangent Kernel regime), which also studies overparametrization but reaches different conclusions about posterior behavior. The experimental evidence is strong for the studied regimes but limited primarily to UCI tabular datasets; the single CIFAR-10 result in Appendix E is insufficient to validate claims about scalability to modern deep learning settings.

“In sampling-based inference, however, a zero-mean weight distribution does not imply a degenerate model, as it does not concentrate on a single solution as variational methods typically do”
Kobialka et al. · Section 5.3
“work such as Trippe and Turner (2017) expressed concern that a potential pathology for variational approximations of overparametrized BNN posteriors can paradoxically lead to worse performance”
Kobialka et al. · Section 2
Reproducibility

The paper demonstrates strong reproducibility practices. The authors provide detailed experimental protocols in Appendix D, including dataset specifications (UCI repositories), hyperparameter settings, and performance metrics (LPPD). Software dependencies are clearly specified (JAX, BlackJAX, NumPyro), and the code is available via anonymous links during review. The compute infrastructure (GPUs) is described, and sampling budgets (up to 10 million samples, 10,000 independent chains) are explicitly stated. The checklist confirms that all training details, data splits, and random seed handling are documented. The only barrier to full reproduction is computational: replicating the 10-million-sample experiments requires significant GPU resources, though the paper provides trajectory analyses suggesting convergence with fewer chains (10-20) for practical purposes.

“Software ... JAX ... BlackJAX ... NumPyro. Computing Infrastructure ... NVIDIA A100 GPUs”
Kobialka et al. · Appendix D.1
“Cumulative LPPD increase over the number of chains ... for different architectures”
Kobialka et al. · Figure 10 caption
“The code, data, and instructions needed to reproduce the main experimental results ... Yes, in Appendix D”
Kobialka et al. · Reproducibility Checklist
Abstract

Bayesian neural network (BNN) posteriors are often considered impractical for inference, as symmetries fragment them, non-identifiabilities inflate dimensionality, and weight-space priors are seen as meaningless. In this work, we study how overparametrization and priors together reshape BNN posteriors and derive implications allowing us to better understand their interplay. We show that redundancy introduces three key phenomena that fundamentally reshape the posterior geometry: balancedness, weight reallocation on equal-probability manifolds, and prior conformity. We validate our findings through extensive experiments with posterior sampling budgets that far exceed those of earlier works, and demonstrate how overparametrization induces structured, prior-aligned weight posterior distributions.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.