Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
27 papers in stat.ML
Trending mixes fresh papers with community signal.
0
cs.LGcs.AIcs.CL Guhao Feng, Shengjie Luo, Kai Hua et al. · Apr 7, 2026

This paper addresses the static nature of Large Language Models that prevents dynamic adaptation to streaming contexts. The authors introduce In-Place Test-Time Training, which repurposes existing MLP down-projection matrices as “fast weights” that update during inference via a Next-Token Prediction (NTP)-aligned objective. Unlike prior TTT methods that require architectural changes, this approach enables “drop-in” enhancement of pretrained models without retraining from scratch.

The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.
0
stat.MEstat.ML Qiong Zhang, Qinglong Tian, Pengfei Li · Mar 23, 2026

Neyman–Pearson multiclass classification (NPMC) handles asymmetric error costs by constraining class-specific misclassification rates, yet existing methods fail when training labels are corrupted. This paper proposes an empirical likelihood (EL) framework that recovers true class proportions and posterior probabilities from noisy labels via an exponential tilting density ratio model, enabling valid error control without prior knowledge of the noise transition matrix. The approach combines semiparametric estimation theory with a practical EM algorithm, yielding classifiers that satisfy NP oracle inequalities asymptotically.

In many classification problems, the costs of misclassifying observations from different classes can be highly unequal. The Neyman-Pearson multiclass classification (NPMC) framework addresses this issue by minimizing a weighted misclassification risk while imposing upper bounds on class-specific error probabilities. Existing NPMC methods typically assume that training labels are correctly observed. In practice, however, labels are often corrupted due to measurement error or annotation, and the effect of such label noise on NPMC procedures remains largely unexplored. We study the NPMC problem when only noisy labels are available in the training data. We propose an empirical likelihood (EL)-based method that relates the distributions of noisy and true labels through an exponential tilting density ratio model. The resulting maximum EL estimators recover the class proportions and posterior probabilities of the clean labels required for error control. We establish consistency, asymptotic normality, and optimal convergence rates for these estimators. Under mild conditions, the resulting classifier satisfies NP oracle inequalities with respect to the true labels asymptotically. An expectation-maximization algorithm computes the maximum EL estimators. Simulations show that the proposed method performs comparably to the oracle classifier under clean labels and substantially improves over procedures that ignore label noise.
0
stat.APstat.MEstat.ML Emmett B. Kendall, Jonathan P. Williams, Curtis B. Storlie et al. · Mar 23, 2026

This paper addresses the critical challenge of detecting occult hemorrhage (internal bleeding) in intensive care units, where delayed diagnosis leads to preventable physiological shock and death. The authors develop a Bayesian regime switching model (RSM) that tracks five latent physiological states—including stable, hemorrhage, and recovery—using longitudinal vital signs (heart rate, MAP, hemoglobin, lactate) and medication history. Applied to 33,924 Mayo Clinic ICU encounters, the model aims to provide interpretable, probabilistic early warnings that outperform standard vital sign monitoring by accounting for autoregressive trends and pre-admission physiological changes.

Detection of occult hemorrhage (i.e., internal bleeding) in patients in intensive care units (ICUs) can pose significant challenges for critical care workers. Because blood loss may not always be clinically apparent, clinicians rely on monitoring vital signs for specific trends indicative of a hemorrhage event. The inherent difficulties of diagnosing such an event can lead to late intervention by clinicians which has catastrophic consequences. Therefore, a methodology for early detection of hemorrhage has wide utility. We develop a Bayesian regime switching model (RSM) that analyzes trends in patients' vitals and labs to provide a probabilistic assessment of the underlying physiological state that a patient is in at any given time. This article is motivated by a comprehensive dataset we curated from Mayo Clinic of 33,924 real ICU patient encounters. Longitudinal response measurements are modeled as a vector autoregressive process conditional on all latent states up to the current time point, and the latent states follow a Markov process. We present a novel Bayesian sampling routine to learn the posterior probability distribution of the latent physiological states, as well as develop an approach to account for pre-ICU-admission physiological changes. A simulation and real case study illustrate the effectiveness of our approach.
0
econ.EMstat.ML Guillaume Bied, Philippe Caillou, Bruno Cr\'epon et al. · Mar 23, 2026

Job recommender systems deployed by public employment services are typically optimized for predictive metrics like clicks, applications, or hires rather than job seeker welfare. This paper develops a structural job-search model where vacancy value depends on utility $U$ and hiring probability $p$, deriving a welfare-optimal ranking based on an expected-surplus index $\Gamma(p, U) = p \sigma \log(1 + e^{\Delta(p,U)/\sigma})$. Through two randomized field experiments with the French public employment service, the authors demonstrate that algorithms approximating this theoretical benchmark substantially outperform existing approaches, while formalizing the "inversion problem" where behavior-based rankings diverge from welfare-maximizing ones.

Recommendation systems (RSs) are increasingly used to guide job seekers on online platforms, yet the algorithms currently deployed are typically optimized for predictive objectives such as clicks, applications, or hires, rather than job seekers' welfare. We develop a job-search model with an application stage in which the value of a vacancy depends on two dimensions: the utility it delivers to the worker and the probability that an application succeeds. The model implies that welfare-optimal RSs rank vacancies by an expected-surplus index combining both, and shows why rankings based solely on utility, hiring probabilities, or observed application behavior are generically suboptimal, an instance of the inversion problem between behavior and welfare. We test these predictions and quantify their practical importance through two randomized field experiments conducted with the French public employment service. The first experiment, comparing existing algorithms and their combinations, provides behavioral evidence that both dimensions shape application decisions. Guided by the model and these results, the second experiment extends the comparison to an RS designed to approximate the welfare-optimal ranking. The experiments generate exogenous variation in the vacancies shown to job seekers, allowing us to estimate the model, validate its behavioral predictions, and construct a welfare metric. Algorithms informed by the model-implied optimal ranking substantially outperform existing approaches and perform close to the welfare-optimal benchmark. Our results show that embedding predictive tools within a simple job-search framework and combining it with experimental evidence yields recommendation rules with substantial welfare gains in practice.
0
cs.LGstat.ML Qilin Wang · Mar 23, 2026

This paper proposes a fundamental shift in evaluating probabilistic time series forecasting by replacing passive observation of historical trajectories with an interventionist "noise titration" protocol. By injecting calibrated Gaussian noise into known chaotic and stochastic dynamical systems, the authors transform forecasting into an exact distributional inference task where statistical calibration can be verified against ground-truth likelihoods. They extend the Fern architecture to output full covariance structures via SPD cone parameterization, then use the framework to expose severe failures in zero-shot foundation models under non-stationarity.

Modern time series forecasting is evaluated almost entirely through passive observation of single historical trajectories, rendering claims about a model's robustness to non-stationarity fundamentally unfalsifiable. We propose a paradigm shift toward interventionist, exact-statistical benchmarking. By systematically titrating calibrated Gaussian observation noise into known chaotic and stochastic dynamical systems, we transform forecasting from a black-box sequence matching game into an exact distributional inference task. Because the underlying data-generating process and noise variance are mathematically explicit, evaluation can rely on exact negative log-likelihoods and calibrated distributional tests rather than heuristic approximations. To fully leverage this framework, we extend the Fern architecture into a probabilistic generative model that natively parameterizes the Symmetric Positive Definite (SPD) cone, outputting calibrated joint covariance structures without the computational bottleneck of generic Jacobian modeling. Under this rigorous evaluation, we find that state-of-the-art zero-shot foundation models behave consistently with the context-parroting mechanism, failing systematically under non-stationary regime shifts and elevated noise. In contrast, Fern explicitly captures the invariant measure and multivariate geometry of the underlying dynamics, maintaining structural fidelity and statistically sharp calibration precisely where massive sequence-matching models collapse.
0
stat.MLcs.LG L. Riso, M.G. Zoia · Mar 23, 2026

Traditional concentration indices like the Herfindahl-Hirschman Index ($HHI = \sum_i w_i^2$) measure weight dispersion but ignore network topology, meaning two systems with identical weight distributions can exhibit different effective concentration. This paper introduces the Network Concentration Index (NCI), defined as $\psi(w,A) = \frac{w^{\top}Aw}{1-\sum_i w_i^2}$, which measures the fraction of potential weighted interconnection realized along observed network links. The framework unifies weight distributions with interaction structures, providing a theoretically grounded tool for assessing systemic risk in financial networks, supply chains, and economic production systems.

This paper develops a unified framework for measuring concentration in weighted systems embedded in networks of interactions. While traditional indices such as the Herfindahl-Hirschman Index capture dispersion in weights, they neglect the topology of relationships among the elements receiving those weights. To address this limitation, we introduce a family of topology-aware concentration indices that jointly account for weight distributions and network structure. At the core of the framework lies a baseline Network Concentration Index (NCI), defined as a normalized quadratic form that measures the fraction of potential weighted interconnection realized along observed network links. Building on this foundation, we construct a flexible class of extensions that modify either the interaction structure or the normalization benchmark, including weighted, density-adjusted, null-model, degree-constrained, transformed-data, and multi-layer variants. This family of indices preserves key properties such as normalization, invariance, and interpretability, while allowing concentration to be evaluated across different dimensions of dependence, including intensity, higher-order interactions, and extreme events. Theoretical results characterize the indices and establish their relationship with classical concentration and network measures. Empirical and simulation evidence demonstrate that systems with identical weight distributions may exhibit markedly different levels of structural concentration depending on network topology, highlighting the additional information captured by the proposed framework. The approach is broadly applicable to economic, financial, and complex systems in which weighted elements interact through networks.
0
math.OCmath.PRstat.ML Samy Mekkaoui, Huy\^en Pham, Xavier Warin · Mar 23, 2026

This paper develops a neural operator framework for approximating mappings defined on constrained Wasserstein spaces $\mathcal{M}_\lambda$, consisting of probability measures on $I \times \mathbb{R}^d$ with prescribed marginal $\lambda$ on the label space $I$. The core contribution is the DeepONetCyl architecture, which combines cylindrical moment approximations $\Phi_J(\mu) = (\langle \varphi_1, \mu \rangle, \ldots, \langle \varphi_J, \mu \rangle)$ with a DeepONet-type branch–trunk structure to preserve the marginal constraint. This enables learning of heterogeneous (non-exchangeable) mean-field control problems where agent interactions depend on labels, extending prior neural methods beyond the exchangeable case.

We study the approximation of operators acting on probability measures on a product space with prescribed marginal. Let $I$ be a label space endowed with a reference measure $\lambda$, and define $\cal M_\lambda$ as the set of probability measures on $I\times \mathbb{R}^d$ with first marginal $\lambda$. By disintegration, elements of $\cal M_\lambda$ correspond to families of labeled conditional distributions. Operators defined on this constrained measure space arise naturally in mean-field control problems with heterogeneous, non-exchangeable agents. Our main theoretical result establishes a universal approximation theorem for continuous operators on $\cal M_\lambda$. The proof combines cylindrical approximations of probability measures with DeepONet-type branch-trunk neural architecture, yielding finite-dimensional representations of such operators. We further introduce a sampling strategy for generating training measures in $\cal M_\lambda$, enabling practical learning of such conditional mean-field operators. We apply the method to the numerical resolution of mean-field control problems with heterogeneous interactions, thereby extending previous neural approaches developed for homogeneous (exchangeable) systems. Numerical experiments illustrate the accuracy and computational effectiveness of the proposed framework.
0
stat.MLcs.LG Atticus Rex, Elizabeth Qian, David Peterson · Mar 23, 2026

Multifidelity surrogate modeling aims to leverage cheap low-fidelity simulations to improve predictions of expensive high-fidelity models when training data is scarce. This paper proposes MAGPI, a Gaussian process regression method that augments the high-fidelity input space with features derived from recursively-trained low-fidelity surrogate models. The approach unifies desirable properties from cokriging and autoregressive estimators while allowing non-GP models for low-fidelity levels, achieving superior accuracy and computational efficiency.

Supervised machine learning describes the practice of fitting a parameterized model to labeled input-output data. Supervised machine learning methods have demonstrated promise in learning efficient surrogate models that can (partially) replace expensive high-fidelity models, making many-query analyses, such as optimization, uncertainty quantification, and inference, tractable. However, when training data must be obtained through the evaluation of an expensive model or experiment, the amount of training data that can be obtained is often limited, which can make learned surrogate models unreliable. However, in many engineering and scientific settings, cheaper \emph{low-fidelity} models may be available, for example arising from simplified physics modeling or coarse grids. These models may be used to generate additional low-fidelity training data. The goal of \emph{multifidelity} machine learning is to use both high- and low-fidelity training data to learn a surrogate model which is cheaper to evaluate than the high-fidelity model, but more accurate than any available low-fidelity model. This work proposes a new multifidelity training approach for Gaussian process regression which uses low-fidelity data to define additional features that augment the input space of the learned model. The approach unites desirable properties from two separate classes of existing multifidelity GPR approaches, cokriging and autoregressive estimators. Numerical experiments on several test problems demonstrate both increased predictive accuracy and reduced computational cost relative to the state of the art.
0
cs.LGstat.ML Julius Kobialka, Emanuel Sommer, Chris Kolb et al. · Mar 23, 2026

Bayesian neural networks (BNNs) suffer from fragmented, high-dimensional posteriors due to weight-space symmetries, raising doubts about the practicality of sampling-based inference. This paper demonstrates that overparametrization—using more hidden units than necessary—actually transforms the posterior geometry in beneficial ways. The authors identify three key phenomena induced by redundancy: balancedness (norm equalization across layers), weight reallocation on equal-probability manifolds (following Dirichlet distributions), and prior conformity (marginals aligning with zero-mean Gaussian priors). Through theory for ReLU networks and extensive experiments with up to 10 million posterior samples, the work explains why recent sampling methods succeed and provides a principled foundation for understanding weight priors in overparametrized regimes.

Bayesian neural network (BNN) posteriors are often considered impractical for inference, as symmetries fragment them, non-identifiabilities inflate dimensionality, and weight-space priors are seen as meaningless. In this work, we study how overparametrization and priors together reshape BNN posteriors and derive implications allowing us to better understand their interplay. We show that redundancy introduces three key phenomena that fundamentally reshape the posterior geometry: balancedness, weight reallocation on equal-probability manifolds, and prior conformity. We validate our findings through extensive experiments with posterior sampling budgets that far exceed those of earlier works, and demonstrate how overparametrization induces structured, prior-aligned weight posterior distributions.
0
cs.LGstat.ML Maryam Boubekraoui, Giordano d'Aloisio, Antinisca Di Marco · Mar 22, 2026

While most bias mitigation research targets binary classification, multi-class fairness remains under-explored. This paper proposes Generalised Exponentiated Gradient (GEG), an in-processing method that extends the Exponentiated Gradient framework to multi-class settings and enables simultaneous optimization of multiple fairness constraints via positive-label moment conditions. Evaluated on ten datasets against six baselines, GEG achieves fairness improvements up to 92% with moderate accuracy trade-offs, filling a critical gap in fair machine learning toolboxes.

The widespread use of AI and ML models in sensitive areas raises significant concerns about fairness. While the research community has introduced various methods for bias mitigation in binary classification tasks, the issue remains under-explored in multi-class classification settings. To address this limitation, in this paper, we first formulate the problem of fair learning in multi-class classification as a multi-objective problem between effectiveness (i.e., prediction correctness) and multiple linear fairness constraints. Next, we propose a Generalised Exponentiated Gradient (GEG) algorithm to solve this task. GEG is an in-processing algorithm that enhances fairness in binary and multi-class classification settings under multiple fairness definitions. We conduct an extensive empirical evaluation of GEG against six baselines across seven multi-class and three binary datasets, using four widely adopted effectiveness metrics and three fairness definitions. GEG overcomes existing baselines, with fairness improvements up to 92% and a decrease in accuracy up to 14%.
0
cs.LGstat.ML Alexandra Zelenin, Alexandra Zhuravlyova · Mar 23, 2026

This paper tackles the memory explosion problem in high-rank DoRA fine-tuning. At $d_{in}=8192$ and rank $r=384$, computing the row-wise norm $\|\mathbf{W}+s\mathbf{B}\mathbf{A}\|_{\text{row}}$ via standard materialization consumes ~512 MB per module—prohibitive for large models with hundreds of adapted layers. The authors propose a factored norm decomposition that reduces the computation to $\mathcal{O}(d_{out}r+r^2)$ intermediates plus fused Triton kernels that collapse the composition into a single pass. On 8–32B vision-language models, this yields 1.5–2.0× speedups and up to 77 GB VRAM savings without numerical drift.

Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice. Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT's DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.
0
cs.LGstat.ML Paolo Toccaceli · Mar 23, 2026

This paper addresses conditional distribution estimation for regression by proposing a non-parametric binning approach. Observations sorted by a one-dimensional covariate are partitioned into contiguous bins via dynamic programming, minimizing a closed-form leave-one-out CRPS cost function. The method produces conformal prediction sets with finite-sample marginal coverage guarantees and connects to Venn predictors, offering substantially narrower intervals than standard split-conformal methods on heteroscedastic and bimodal benchmarks.

We propose a method for non-parametric conditional distribution estimation based on partitioning covariate-sorted observations into contiguous bins and using the within-bin empirical CDF as the predictive distribution. Bin boundaries are chosen to minimise the total leave-one-out Continuous Ranked Probability Score (LOO-CRPS), which admits a closed-form cost function with $O(n^2 \log n)$ precomputation and $O(n^2)$ storage; the globally optimal $K$-partition is recovered by a dynamic programme in $O(n^2 K)$ time. Minimisation of Within-sample LOO-CRPS turns out to be inappropriate for selecting $K$ as it results in in-sample optimism. So we instead select $K$ by evaluating test CRPS on an alternating held-out split, which yields a U-shaped criterion with a well-defined minimum. Having selected $K^*$ and fitted the full-data partition, we form two complementary predictive objects: the Venn prediction band and a conformal prediction set based on CRPS as the nonconformity score, which carries a finite-sample marginal coverage guarantee at any prescribed level $\varepsilon$. On real benchmarks against split-conformal competitors (Gaussian split conformal, CQR, and CQR-QRF), the method produces substantially narrower prediction intervals while maintaining near-nominal coverage.
0
stat.MLcs.LG MD Ruiz-Medina, AE Madrid, A Torres-Signes et al. · Mar 22, 2026

The paper addresses functional Gaussian Process regression on compact Riemannian manifolds, proposing a time-adaptive Empirical Bayes framework that exploits invariance of covariance kernels under isometries and spectral decomposition via Laplace–Beltrami eigenfunctions. The core idea is to work in the time-varying angular spectral domain, truncating the infinite-dimensional expansion based on functional sample size (typically logarithmic) to balance computational cost with approximation accuracy. This matters because it extends GP regression to infinite-dimensional functional settings on non-Euclidean domains while attempting to maintain computational tractability through spectral truncation schemes.

This paper proposes a new formulation of functional Gaussian Process regression in manifolds, based on an Empirical Bayes approach, in the spatiotemporal random field context. We apply the machinery of tight Gaussian measures in separable Hilbert spaces, exploiting the invariance property of covariance kernels under the group of isometries of the manifold. The identification of these measures with infinite-product Gaussian measures is then obtained via the eigenfunctions of the Laplace-Beltrami operator on the manifold. The involved time-varying angular spectra constitute the key tool for dimension reduction in the implementation of this regression approach, adopting a suitable truncation scheme depending on the functional sample size. The simulation study and synthetic data application undertaken illustrate the finite sample and asymptotic properties of the proposed functional regression predictor.
0
stat.MLcs.LG Shailesh Garg, Souvik Chakraborty · Mar 23, 2026

Time-dependent reliability analysis for nonlinear dynamical systems under stochastic loading is computationally prohibitive with Monte Carlo simulation. CoNBONet proposes a surrogate combining DeepONet operator learning with Variable Spiking Neurons (VSNs) for sparse computation, Bayesian variational inference for uncertainty, and split conformal prediction for calibration. The goal is fast, energy-efficient inference with theoretical guarantees on reliability estimates.

Time-dependent reliability analysis of nonlinear dynamical systems under stochastic excitations is a critical yet computationally demanding task. Conventional approaches, such as Monte Carlo simulation, necessitate repeated evaluations of computationally expensive numerical solvers, leading to significant computational bottlenecks. To address this challenge, we propose \textit{CoNBONet}, a neuroscience-inspired surrogate model that enables fast, energy-efficient, and uncertainty-aware reliability analysis, providing a scalable alternative to techniques such as Monte Carlo simulations. CoNBONet, short for \textbf{Co}nformalized \textbf{N}euroscience-inspired \textbf{B}ayesian \textbf{O}perator \textbf{Net}work, leverages the expressive power of deep operator networks while integrating neuroscience-inspired neuron models to achieve fast, low-power inference. Unlike traditional surrogates such as Gaussian processes, polynomial chaos expansions, or support vector regression, that may face scalability challenges for high-dimensional, time-dependent reliability problems, CoNBONet offers \textit{fast and energy-efficient inference} enabled by a neuroscience-inspired network architecture, \textit{calibrated uncertainty quantification with theoretical guarantees} via split conformal prediction, and \textit{strong generalization capability} through an operator-learning paradigm that maps input functions to system response trajectories. Validation of the proposed CoNBONet for various nonlinear dynamical systems demonstrates that CoNBONet preserves predictive fidelity, and achieves reliable coverage of failure probabilities, making it a powerful tool for robust and scalable reliability analysis in engineering design.
0
cs.LGstat.ML Shreeram Murali, Cristian R. Rojas, Dominik Baumann · Mar 23, 2026

The paper proposes a non-parametric classifier based on the Nadaraya-Watson (NW) estimator that achieves linear $O(n)$ computational complexity while providing frequentist uncertainty bounds on predictions. By reformulating kernel regression for multi-class classification and deriving error bounds under Lipschitz continuity or separability assumptions, the authors bridge the gap between efficient "black box" methods and computationally expensive approaches like Gaussian Processes that offer formal guarantees. The method achieves $>96\%$ accuracy on MIT-BIH ECG data with uncertainty intervals that flag low-confidence predictions, making it suitable for safety-critical applications.

While both classical and neural network classifiers can achieve high accuracy, they fall short on offering uncertainty bounds on their predictions, making them unfit for safety-critical applications. Existing kernel-based classifiers that provide such bounds scale with $\mathcal O (n^{\sim3})$ in time, making them computationally intractable for large datasets. To address this, we propose a novel, computationally efficient classification algorithm based on the Nadaraya-Watson estimator, for whose estimates we derive frequentist uncertainty intervals. We evaluate our classifier on synthetically generated data and on electrocardiographic heartbeat signals from the MIT-BIH Arrhythmia database. We show that the method achieves competitive accuracy $>$\SI{96}{\percent} at $\mathcal O(n)$ and $\mathcal O(\log n)$ operations, while providing actionable uncertainty bounds. These bounds can, e.g., aid in flagging low-confidence predictions, making them suitable for real-time settings with resource constraints, such as diagnostic monitoring or implantable devices.
0
cs.LGstat.ML Mohammed Abdullah, George Iosifidis, Salah Eddine Elayoubi et al. · Mar 22, 2026

The paper tackles Constrained Online Convex Optimization with Memory (COCO-M), where both losses and constraints depend on a window of past decisions, capturing realistic scenarios like smart-grid budgets and battery health limits. The authors propose the first algorithms achieving sublinear regret and cumulative constraint violation (CCV) under adversarial, time-varying constraints, both with and without unreliable predictions of future gradients. This work bridges the gap between classical constrained OCO and practical memory-dependent control problems.

We study Constrained Online Convex Optimization with Memory (COCO-M), where both the loss and the constraints depend on a finite window of past decisions made by the learner. This setting extends the previously studied unconstrained online optimization with memory framework and captures practical problems such as the control of constrained dynamical systems and scheduling with reconfiguration budgets. For this problem, we propose the first algorithms that achieve sublinear regret and sublinear cumulative constraint violation under time-varying constraints, both with and without predictions of future loss and constraint functions. Without predictions, we introduce an adaptive penalty approach that guarantees sublinear regret and constraint violation. When short-horizon and potentially unreliable predictions are available, we reinterpret the problem as online learning with delayed feedback and design an optimistic algorithm whose performance improves as prediction accuracy improves, while remaining robust when predictions are inaccurate. Our results bridge the gap between classical constrained online convex optimization and memory-dependent settings, and provide a versatile learning toolbox with diverse applications.
0
stat.MLcs.LGphysics.comp-ph Brianna Binder, Assad Oberai · Mar 22, 2026

This paper proposes a training-free conditional diffusion model for Bayesian filtering in data assimilation. Instead of learning the score function via neural networks, the authors leverage kernel density estimation (KDE) to represent the joint distribution of states and measurements, yielding a closed-form expression for the score that enables analytical sampling from the posterior. The method targets nonlinear, non-Gaussian filtering problems where traditional ensemble Kalman filters (EnKF) make restrictive Gaussian approximations and particle filters suffer from weight degeneracy in small-ensemble regimes.

We propose closed-form conditional diffusion models for data assimilation. Diffusion models use data to learn the score function (defined as the gradient of the log-probability density of a data distribution), allowing them to generate new samples from the data distribution by reversing a noise injection process. While it is common to train neural networks to approximate the score function, we leverage the analytical tractability of the score function to assimilate the states of a system with measurements. To enable the efficient evaluation of the score function, we use kernel density estimation to model the joint distribution of the states and their corresponding measurements. The proposed approach also inherits the capability of conditional diffusion models of operating in black-box settings, i.e., the proposed data assimilation approach can accommodate systems and measurement processes without their explicit knowledge. The ability to accommodate black-box systems combined with the superior capabilities of diffusion models in approximating complex, non-Gaussian probability distributions means that the proposed approach offers advantages over many widely used filtering methods. We evaluate the proposed method on nonlinear data assimilation problems based on the Lorenz-63 and Lorenz-96 systems of moderate dimensionality and nonlinear measurement models. Results show the proposed approach outperforms the widely used ensemble Kalman and particle filters when small to moderate ensemble sizes are used.
0
cs.LGcs.AIcs.IT Changxiao Cai, Gen Li · Mar 23, 2026

Diffusion language models (DLMs) enable parallel token generation, but their efficiency depends critically on the decoding strategy that determines which tokens to unmask and when. This paper investigates confidence-based decoding—specifically an entropy sum strategy that adaptively batches tokens until cumulative prediction uncertainty exceeds a threshold—and proves it achieves $\varepsilon$-accurate sampling in KL divergence with expected iteration complexity $\widetilde{O}(H(X_0)/\varepsilon)$. When the data distribution has low entropy ($H(X_0) \ll L$), this yields sublinear complexity in sequence length, providing the first theoretical foundation for why confidence-based methods accelerate sampling without sacrificing fidelity.

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models for language modeling, allowing flexible generation order and parallel generation of multiple tokens. However, this flexibility introduces a challenge absent in AR models: the \emph{decoding strategy} -- which determines the order and number of tokens generated at each iteration -- critically affects sampling efficiency. Among decoding strategies explored in practice, confidence-based methods, which adaptively select which and how many tokens to unmask based on prediction confidence, have shown strong empirical performance. Despite this success, our theoretical understanding of confidence-based decoding remains limited. In this work, we develop the first theoretical analysis framework for confidence-based decoding in DLMs. We focus on an entropy sum-based strategy that continues unmasking tokens within each iteration until the cumulative entropy exceeds a threshold, and show that it achieves $\varepsilon$-accurate sampling in KL divergence with an expected number of iterations $\widetilde O(H(X_0)/\varepsilon)$, where $H(X_0)$ denotes the entropy of the target data distribution. Notably, this strategy yields substantial sampling acceleration when the data distribution has low entropy relative to the sequence length, while automatically adapting to the intrinsic complexity of data without requiring prior knowledge or hyperparameter tuning. Overall, our results provide a theoretical foundation for confidence-based decoding and may inform the design of more efficient decoding strategies for DLMs.
0
stat.MLcs.LGmath.PR Vivek Shripad Borkar · Mar 22, 2026

This paper extends stochastic approximation (SA) theory to non-Markovian driving noise that is also non-ergodic, establishing that the ergodic decomposition of the original process corresponds to a Doeblin decomposition of an equivalent Markov chain. The core insight is that iterates retain memory of the distant past through the tail $\sigma$-field at $-\infty$, offering a theoretical lens on how learning algorithms might encode long-term dependencies. The author proposes this framework as a paradigm for understanding transformer attention mechanisms and continual learning, where the entire history influences current updates.

Based on some recent work of the author on stochastic approximation in non-markovian environments, the situation when the driving random process is non-ergodic in addition to being non-markovian is considered. Using this, we propose an analytic framework for understanding transformer based learning, specifically, the `attention' mechanism, and continual learning, both of which depend on the entire past in principle.
0
cs.LGmath.OCstat.ML Rustem Islamov, Roman Machacek, Aurelien Lucchi et al. · Mar 22, 2026

This paper studies how batch size and sequence length should scale with the total token budget in stochastic conditional gradient methods for LLM training. Under a $\mu$-Kurdyka-\L ojasiewicz condition, the authors derive a BST (Batch-Sequence-Token) scaling rule $BS \asymp T^{2/3}$ that predicts three distinct regimes: noise-dominated, batch-independent optimal, and iteration-starved. The theory yields actionable guidelines for adaptive batch size scheduling and is validated on NanoGPT models up to 1B parameters.

We study the role of batch size in stochastic conditional gradient methods under a $\mu$-Kurdyka-{\L}ojasiewicz ($\mu$-KL) condition. Focusing on momentum-based stochastic conditional gradient algorithms (e.g., Scion), we derive a new analysis that explicitly captures the interaction between stepsize, batch size, and stochastic noise. Our study reveals a regime-dependent behavior: increasing the batch size initially improves optimization accuracy but, beyond a critical threshold, the benefits saturate and can eventually degrade performance under a fixed token budget. Notably, the theory predicts the magnitude of the optimal stepsize and aligns well with empirical practices observed in large-scale training. Leveraging these insights, we derive principled guidelines for selecting the batch size and stepsize, and propose an adaptive strategy that increases batch size and sequence length during training while preserving convergence guarantees. Experiments on NanoGPT are consistent with the theoretical predictions and illustrate the emergence of the predicted scaling regimes. Overall, our results provide a theoretical framework for understanding batch size scaling in stochastic conditional gradient methods and offer guidance for designing efficient training schedules in large-scale optimization.