Nothing here yet
This paper addresses the static nature of Large Language Models that prevents dynamic adaptation to streaming contexts. The authors introduce In-Place Test-Time Training, which repurposes existing MLP down-projection matrices as “fast weights” that update during inference via a Next-Token Prediction (NTP)-aligned objective. Unlike prior TTT methods that require architectural changes, this approach enables “drop-in” enhancement of pretrained models without retraining from scratch.
Neyman–Pearson multiclass classification (NPMC) handles asymmetric error costs by constraining class-specific misclassification rates, yet existing methods fail when training labels are corrupted. This paper proposes an empirical likelihood (EL) framework that recovers true class proportions and posterior probabilities from noisy labels via an exponential tilting density ratio model, enabling valid error control without prior knowledge of the noise transition matrix. The approach combines semiparametric estimation theory with a practical EM algorithm, yielding classifiers that satisfy NP oracle inequalities asymptotically.
This paper addresses the critical challenge of detecting occult hemorrhage (internal bleeding) in intensive care units, where delayed diagnosis leads to preventable physiological shock and death. The authors develop a Bayesian regime switching model (RSM) that tracks five latent physiological states—including stable, hemorrhage, and recovery—using longitudinal vital signs (heart rate, MAP, hemoglobin, lactate) and medication history. Applied to 33,924 Mayo Clinic ICU encounters, the model aims to provide interpretable, probabilistic early warnings that outperform standard vital sign monitoring by accounting for autoregressive trends and pre-admission physiological changes.
Job recommender systems deployed by public employment services are typically optimized for predictive metrics like clicks, applications, or hires rather than job seeker welfare. This paper develops a structural job-search model where vacancy value depends on utility $U$ and hiring probability $p$, deriving a welfare-optimal ranking based on an expected-surplus index $\Gamma(p, U) = p \sigma \log(1 + e^{\Delta(p,U)/\sigma})$. Through two randomized field experiments with the French public employment service, the authors demonstrate that algorithms approximating this theoretical benchmark substantially outperform existing approaches, while formalizing the "inversion problem" where behavior-based rankings diverge from welfare-maximizing ones.
This paper proposes a fundamental shift in evaluating probabilistic time series forecasting by replacing passive observation of historical trajectories with an interventionist "noise titration" protocol. By injecting calibrated Gaussian noise into known chaotic and stochastic dynamical systems, the authors transform forecasting into an exact distributional inference task where statistical calibration can be verified against ground-truth likelihoods. They extend the Fern architecture to output full covariance structures via SPD cone parameterization, then use the framework to expose severe failures in zero-shot foundation models under non-stationarity.
Traditional concentration indices like the Herfindahl-Hirschman Index ($HHI = \sum_i w_i^2$) measure weight dispersion but ignore network topology, meaning two systems with identical weight distributions can exhibit different effective concentration. This paper introduces the Network Concentration Index (NCI), defined as $\psi(w,A) = \frac{w^{\top}Aw}{1-\sum_i w_i^2}$, which measures the fraction of potential weighted interconnection realized along observed network links. The framework unifies weight distributions with interaction structures, providing a theoretically grounded tool for assessing systemic risk in financial networks, supply chains, and economic production systems.
This paper develops a neural operator framework for approximating mappings defined on constrained Wasserstein spaces $\mathcal{M}_\lambda$, consisting of probability measures on $I \times \mathbb{R}^d$ with prescribed marginal $\lambda$ on the label space $I$. The core contribution is the DeepONetCyl architecture, which combines cylindrical moment approximations $\Phi_J(\mu) = (\langle \varphi_1, \mu \rangle, \ldots, \langle \varphi_J, \mu \rangle)$ with a DeepONet-type branch–trunk structure to preserve the marginal constraint. This enables learning of heterogeneous (non-exchangeable) mean-field control problems where agent interactions depend on labels, extending prior neural methods beyond the exchangeable case.
Multifidelity surrogate modeling aims to leverage cheap low-fidelity simulations to improve predictions of expensive high-fidelity models when training data is scarce. This paper proposes MAGPI, a Gaussian process regression method that augments the high-fidelity input space with features derived from recursively-trained low-fidelity surrogate models. The approach unifies desirable properties from cokriging and autoregressive estimators while allowing non-GP models for low-fidelity levels, achieving superior accuracy and computational efficiency.
Bayesian neural networks (BNNs) suffer from fragmented, high-dimensional posteriors due to weight-space symmetries, raising doubts about the practicality of sampling-based inference. This paper demonstrates that overparametrization—using more hidden units than necessary—actually transforms the posterior geometry in beneficial ways. The authors identify three key phenomena induced by redundancy: balancedness (norm equalization across layers), weight reallocation on equal-probability manifolds (following Dirichlet distributions), and prior conformity (marginals aligning with zero-mean Gaussian priors). Through theory for ReLU networks and extensive experiments with up to 10 million posterior samples, the work explains why recent sampling methods succeed and provides a principled foundation for understanding weight priors in overparametrized regimes.
While most bias mitigation research targets binary classification, multi-class fairness remains under-explored. This paper proposes Generalised Exponentiated Gradient (GEG), an in-processing method that extends the Exponentiated Gradient framework to multi-class settings and enables simultaneous optimization of multiple fairness constraints via positive-label moment conditions. Evaluated on ten datasets against six baselines, GEG achieves fairness improvements up to 92% with moderate accuracy trade-offs, filling a critical gap in fair machine learning toolboxes.
This paper tackles the memory explosion problem in high-rank DoRA fine-tuning. At $d_{in}=8192$ and rank $r=384$, computing the row-wise norm $\|\mathbf{W}+s\mathbf{B}\mathbf{A}\|_{\text{row}}$ via standard materialization consumes ~512 MB per module—prohibitive for large models with hundreds of adapted layers. The authors propose a factored norm decomposition that reduces the computation to $\mathcal{O}(d_{out}r+r^2)$ intermediates plus fused Triton kernels that collapse the composition into a single pass. On 8–32B vision-language models, this yields 1.5–2.0× speedups and up to 77 GB VRAM savings without numerical drift.
This paper addresses conditional distribution estimation for regression by proposing a non-parametric binning approach. Observations sorted by a one-dimensional covariate are partitioned into contiguous bins via dynamic programming, minimizing a closed-form leave-one-out CRPS cost function. The method produces conformal prediction sets with finite-sample marginal coverage guarantees and connects to Venn predictors, offering substantially narrower intervals than standard split-conformal methods on heteroscedastic and bimodal benchmarks.
The paper addresses functional Gaussian Process regression on compact Riemannian manifolds, proposing a time-adaptive Empirical Bayes framework that exploits invariance of covariance kernels under isometries and spectral decomposition via Laplace–Beltrami eigenfunctions. The core idea is to work in the time-varying angular spectral domain, truncating the infinite-dimensional expansion based on functional sample size (typically logarithmic) to balance computational cost with approximation accuracy. This matters because it extends GP regression to infinite-dimensional functional settings on non-Euclidean domains while attempting to maintain computational tractability through spectral truncation schemes.
Time-dependent reliability analysis for nonlinear dynamical systems under stochastic loading is computationally prohibitive with Monte Carlo simulation. CoNBONet proposes a surrogate combining DeepONet operator learning with Variable Spiking Neurons (VSNs) for sparse computation, Bayesian variational inference for uncertainty, and split conformal prediction for calibration. The goal is fast, energy-efficient inference with theoretical guarantees on reliability estimates.
The paper proposes a non-parametric classifier based on the Nadaraya-Watson (NW) estimator that achieves linear $O(n)$ computational complexity while providing frequentist uncertainty bounds on predictions. By reformulating kernel regression for multi-class classification and deriving error bounds under Lipschitz continuity or separability assumptions, the authors bridge the gap between efficient "black box" methods and computationally expensive approaches like Gaussian Processes that offer formal guarantees. The method achieves $>96\%$ accuracy on MIT-BIH ECG data with uncertainty intervals that flag low-confidence predictions, making it suitable for safety-critical applications.
The paper tackles Constrained Online Convex Optimization with Memory (COCO-M), where both losses and constraints depend on a window of past decisions, capturing realistic scenarios like smart-grid budgets and battery health limits. The authors propose the first algorithms achieving sublinear regret and cumulative constraint violation (CCV) under adversarial, time-varying constraints, both with and without unreliable predictions of future gradients. This work bridges the gap between classical constrained OCO and practical memory-dependent control problems.
This paper proposes a training-free conditional diffusion model for Bayesian filtering in data assimilation. Instead of learning the score function via neural networks, the authors leverage kernel density estimation (KDE) to represent the joint distribution of states and measurements, yielding a closed-form expression for the score that enables analytical sampling from the posterior. The method targets nonlinear, non-Gaussian filtering problems where traditional ensemble Kalman filters (EnKF) make restrictive Gaussian approximations and particle filters suffer from weight degeneracy in small-ensemble regimes.
Diffusion language models (DLMs) enable parallel token generation, but their efficiency depends critically on the decoding strategy that determines which tokens to unmask and when. This paper investigates confidence-based decoding—specifically an entropy sum strategy that adaptively batches tokens until cumulative prediction uncertainty exceeds a threshold—and proves it achieves $\varepsilon$-accurate sampling in KL divergence with expected iteration complexity $\widetilde{O}(H(X_0)/\varepsilon)$. When the data distribution has low entropy ($H(X_0) \ll L$), this yields sublinear complexity in sequence length, providing the first theoretical foundation for why confidence-based methods accelerate sampling without sacrificing fidelity.
This paper extends stochastic approximation (SA) theory to non-Markovian driving noise that is also non-ergodic, establishing that the ergodic decomposition of the original process corresponds to a Doeblin decomposition of an equivalent Markov chain. The core insight is that iterates retain memory of the distant past through the tail $\sigma$-field at $-\infty$, offering a theoretical lens on how learning algorithms might encode long-term dependencies. The author proposes this framework as a paradigm for understanding transformer attention mechanisms and continual learning, where the entire history influences current updates.
This paper studies how batch size and sequence length should scale with the total token budget in stochastic conditional gradient methods for LLM training. Under a $\mu$-Kurdyka-\L ojasiewicz condition, the authors derive a BST (Batch-Sequence-Token) scaling rule $BS \asymp T^{2/3}$ that predicts three distinct regimes: noise-dominated, batch-independent optimal, and iteration-starved. The theory yields actionable guidelines for adaptive batch size scheduling and is validated on NanoGPT models up to 1B parameters.