Feed - arxlens

0

Pruned Adaptation Modules: A Simple yet Strong Baseline for Continual Foundation Models

cs.LG Elif Ceren Gok Yildirim, Murat Onur Yildirim, Joaquin Vanschoren · Mar 22, 2026

The paper challenges the rapid shift toward Vision Transformer-based continual learning by demonstrating that lightweight, pruned Convolutional Networks can outperform existing foundation model approaches. The authors propose Pruned Adaptation Modules (PAM), which freeze early ResNet layers and introduce sparsely structured task-specific modules, yielding significant parameter reductions while improving accuracy. This work fills a critical methodological gap by establishing a strong, efficient baseline that questions whether recent advances reflect genuine progress or merely the absence of rigorous ConvNet comparisons.

The continual learning literature has rapidly shifted from traditional class incremental learning (CIL) techniques to foundation model (FM)-based CIL methods without a clear understanding of how these newer approaches compare to strong, lightweight convolutional baselines. This abrupt transition has created a substantial methodological gap, making it difficult to assess whether recent FM-based CIL progress reflects genuine advances or merely the absence of rigorous baselines. To address this gap, we introduce Pruned Adaptation Modules (PAM), a simple yet effective method that freezes the vast majority of the pre-trained ResNet while enabling scalable continual adaptation through sparse task-specific layers. PAM yields up to a ~5x reduction in trainable parameters and a ~6x reduction in total parameters, significantly reducing the cost of continual updates. Across diverse benchmarks, PAM consistently mitigates catastrophic forgetting and outperforms state-of-the-art FM-based CIL approaches. Our findings position PAM as a strong and transparent baseline that helps bridge the gap between traditional and FM-based CIL, guiding future research for a more accurate assessment of true progress in continual adaptation. The code can be found at: https://github.com/ElifCerenGokYildirim/PAM.

Read abstractHide abstract

0

Benchmarking Scientific Machine Learning Models for Air Quality Data

cs.LG Khawja Imran Masud, Venkata Sai Rahul Unnam, Sahara Ali · Mar 22, 2026

This paper benchmarks classical statistical models (LR, SARIMAX), deep learning approaches (MLP, LSTM), and physics-guided variants for multi-horizon AQI forecasting in Dallas County, North Texas. The core innovation is incorporating EPA breakpoint-based AQI formulations as consistency constraints via weighted loss functions ($\mathcal{L}_{total} = \lambda_{data}\mathcal{L}_{data} + \lambda_{phys}\mathcal{L}_{phys}$). The work addresses a practical need for standardized regional model comparison to guide public health decision-making.

Accurate air quality index (AQI) forecasting is essential for the protecting public health in rapidly growing urban regions, and the practical model evaluation and selection are often challenged by the lack of rigorous, region-specific benchmarking on standardized datasets. Physics-guided machine learning and deep learning models could be a good and effective solution to resolve such issues with more accurate and efficient AQI forecasting. This research study presents an explainable and comprehensive benchmark that enables a guideline and proposed physics-guided best model by benchmarking classical time-series, machine-learning, and deep-learning approaches for multi-horizon AQI forecasting in North Texas (Dallas County). Using publicly available U.S. Environmental Protection Agency (EPA) daily observations of air quality data from 2022 to 2024, we curate city-level time series for PM2.5 and O3 by aggregating station measurements and constructing lag-wise forecasting datasets for LAG in {1,7,14,30} days. For benchmarking the best model, linear regression (LR), SARIMAX, multilayer perceptrons (MLP), and LSTM networks are evaluated with the proposed physics-guided variants (MLP+Physics and LSTM+Physics) that incorporate the EPA breakpoint-based AQI formulation as a consistency constraint through a weighted loss. Experiments using chronological train-test splits and error metrics MAE, RMSE showed that deep-learning models outperform simpler baselines, while physics guidance improves stability and yields physically consistent pollutant with AQI relationships, with the largest benefits observed for short-horizon prediction and for PM2.5 and O3. Overall, the results provide a practical reference for selecting AQI forecasting models in North Texas and clarify when lightweight physics constraints meaningfully improve predictive performance across pollutants and forecast horizons.

Read abstractHide abstract

0

Fuel Consumption Prediction: A Comparative Analysis of Machine Learning Paradigms

cs.LG Ali Akram · Mar 22, 2026

This paper compares classical machine learning methods (Linear Regression, SVM, Logistic Regression) for predicting vehicle fuel consumption using the 1974 Motor Trend dataset (N=398). The author argues that these "interpretable" models outperform "black box" deep learning approaches for static physical datasets—a claim that relies on a false equivalence between 50-year-old tabular data and modern time-series telematics applications.

The automotive industry is under growing pressure to reduce its environmental impact, requiring accurate predictive modeling to support sustainable engineering design. This study examines the factors that determine vehicle fuel consumption from the seminal Motor Trend dataset, identifying the governing physical factors of efficiency through rigorous quantitative analysis. Methodologically, the research uses data sanitization, statistical outlier elimination, and in-depth Exploratory Data Analysis (EDA) to curb the occurrence of multicollinearity between powertrain features. A comparative analysis of machine learning paradigms including Multiple Linear Regression, Support Vector Machines (SVM), and Logistic Regression was carried out to assess predictive efficacy. Findings indicate that SVM Regression is most accurate on continuous prediction (R-squared = 0.889, RMSE = 0.326), and is effective in capturing the non-linear relationships between vehicle mass and engine displacement. In parallel, Logistic Regression proved superior for classification (Accuracy = 90.8%) and showed exceptional recall (0.957) when identifying low-efficiency vehicles. These results challenge the current trend toward black-box deep learning architectures for static physical datasets, providing validation of robust performance by interpretable and well-tuned classical models. The research finds that intrinsic vehicle efficiency is fundamentally determined by physical design parameters, weight and displacement, offering a data-driven framework for how manufacturers should focus on lightweighting and engine downsizing to achieve stringent global sustainability goals.

Read abstractHide abstract

0

TabPFN Extensions for Interpretable Geotechnical Modelling

cs.CE cs.LG Taiga Saito, Yu Otake, Daijiro Mizutani et al. · Mar 22, 2026

This exploratory study investigates using TabPFN—a transformer-based tabular foundation model—and its extension library for geotechnical site characterization. The core idea is to leverage in-context learning to perform soil classification and multivariate parameter imputation without model retraining or hyperparameter tuning, while obtaining interpretable insights through embeddings, posterior distributions, and SHAP analysis. This matters because geotechnical engineering requires uncertainty-aware, interpretable predictions for safety-critical decisions, yet faces severe data scarcity.

Geotechnical site characterisation relies on sparse, heterogeneous borehole data where uncertainty quantification and model interpretability are as critical as predictive accuracy for reliable engineering decisions. This paper presents an exploratory investigation into the use of TabPFN, a transformer-based tabular foundation model using in-context learning, and its extension library tabpfn-extensions for two geotechnical inference tasks: (1) soil-type classification using N-value and shear-wave velocity data from a synthetic geotechnical dataset, and (2) iterative imputation of five missing mechanical parameters ($s_\mathrm{u}$, $E_{\mathrm{u}}$, ${\sigma'}_\mathrm{p}$, $C_\mathrm{c}$, $C_\mathrm{v}$) in benchmark problem BM/AirportSoilProperties/2/2025. We apply cosine-similarity analysis to TabPFN-derived embeddings, visualise full posterior distributions from an iterative inference procedure, and compute SHAP-based feature importance, all without model retraining. Learned embeddings clearly separate Clay and Sand samples without explicit soil-type supervision; iterative imputation improves predictions for four of five target parameters, with posterior widths that reflect physically reasonable parameter-specific uncertainty; and SHAP analysis reveals the inter-parameter dependency structure, recovering established geotechnical relationships including the Skempton compression index correlation and the inverse dependence of preconsolidation pressure on water content. These results suggest the potential of foundation-model-based tools to support interpretable, uncertainty-aware parameter inference in data-scarce geotechnical practice.

Read abstractHide abstract

0

Deep Attention-based Sequential Ensemble Learning for BLE-Based Indoor Localization in Care Facilities

cs.LG cs.HC Minh Triet Pham, Quynh Chi Dang, Le Nhat Tan · Mar 22, 2026

This paper addresses BLE-based indoor localization in care facilities by shifting from independent-window classification to sequential learning. The proposed DASEL framework combines frequency-based feature engineering, bidirectional GRUs with attention mechanisms, and a two-level hierarchical ensemble to model temporal movement trajectories. Achieving a 53.1% improvement over traditional baselines on the ABC 2026 challenge dataset, the work demonstrates that capturing temporal dependencies is critical for accurate indoor localization in complex real-world environments.

Indoor localization systems in care facilities enable optimization of staff allocation, workload management, and quality of care delivery. Traditional machine learning approaches to Bluetooth Low Energy (BLE)-based localization treat each temporal measurement as an independent observation, fundamentally limiting their performance. To address this limitation, this paper introduces Deep Attention-based Sequential Ensemble Learning (DASEL), a novel framework that reconceptualizes indoor localization as a sequential learning problem. The framework integrates frequency-based feature engineering, bidirectional GRU networks with attention mechanisms, multi-directional sliding windows, and confidence-weighted temporal smoothing to capture human movement trajectories. Evaluated on real-world data from a care facility using 4-fold temporal cross-validation, DASEL achieves a macro F1 score of 0.4438, representing a 53.1% improvement over the best traditional baseline (0.2898).

Read abstractHide abstract

0

Model Evolution Under Zeroth-Order Optimization: A Neural Tangent Kernel Perspective

cs.LG Chen Zhang, Yuxin Cheng, Chenchen Ding et al. · Mar 22, 2026

Zeroth-order (ZO) optimization enables memory-efficient training via forward-only gradient estimation, but its stochastic nature obscures training dynamics compared to well-characterized first-order (FO) methods. This paper introduces the Neural Zeroth-order Kernel (NZK) to describe model evolution in function space under ZO updates, proving that the expected NZK remains time-invariant for linear models and depends explicitly on the moments of random perturbation directions. The work extends to linearized neural networks and proposes using a single shared random vector to accelerate convergence, with experiments on synthetic and real-world datasets (MNIST, CIFAR-10, Tiny ImageNet) validating the theoretical predictions.

Zeroth-order (ZO) optimization enables memory-efficient training of neural networks by estimating gradients via forward passes only, eliminating the need for backpropagation. However, the stochastic nature of gradient estimation significantly obscures the training dynamics, in contrast to the well-characterized behavior of first-order methods under Neural Tangent Kernel (NTK) theory. To address this, we introduce the Neural Zeroth-order Kernel (NZK) to describe model evolution in function space under ZO updates. For linear models, we prove that the expected NZK remains constant throughout training and depends explicitly on the first and second moments of the random perturbation directions. This invariance yields a closed-form expression for model evolution under squared loss. We further extend the analysis to linearized neural networks. Interpreting ZO updates as kernel gradient descent via NZK provides a novel perspective for potentially accelerating convergence. Extensive experiments across synthetic and real-world datasets (including MNIST, CIFAR-10, and Tiny ImageNet) validate our theoretical results and demonstrate acceleration when using a single shared random vector.

Read abstractHide abstract

0

CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs

cs.LG cs.CL Florent Draye, Abir Harrasse, Vedant Palit et al. · Mar 22, 2026

Cross-Layer Transcoders (CLTs) compress the attribution graphs used in mechanistic interpretability by sharing features across transformer layers, but their quadratic parameter scaling ($N_{\text{CLT}} \propto L^2$) makes training and analysis prohibitively expensive for most researchers. This paper introduces CLT-Forge, an open-source library that combines feature-sharded distributed training, compressed activation caching (int8/int4/int2 with zstd), automated interpretability pipelines, and integration with Circuit-Tracer to provide the first unified workflow for end-to-end CLT analysis at scale.

Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse, interpretable features and their interactions, giving rise to feature attribution graphs. However, these graphs are often large and redundant, limiting their interpretability in practice. Cross-Layer Transcoders (CLTs) address this issue by sharing features across layers while preserving layer-specific decoding, yielding more compact representations, but remain difficult to train and analyze at scale. We introduce an open-source library for end-to-end training and interpretability of CLTs. Our framework integrates scalable distributed training with model sharding and compressed activation caching, a unified automated interpretability pipeline for feature analysis and explanation, attribution graph computation using Circuit-Tracer, and a flexible visualization interface. This provides a practical and unified solution for scaling CLT-based mechanistic interpretability. Our code is available at: https://github.com/LLM-Interp/CLT-Forge.

Read abstractHide abstract

0

When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models

cs.LG Abhinaba Basu · Mar 22, 2026

This paper investigates a fundamental paradox in hybrid sequence models: content-based routing requires exactly the pairwise computation it aims to avoid. Through 20+ controlled experiments, the authors demonstrate that one layer of softmax attention creates a latent $\sim$34-dimensional subspace via value aggregation, enabling 98.4% routing precision, while all alternatives (recurrence, linear attention, contrastive pretraining) cluster at 1–29%. These findings reframe attention as a representation constructor rather than merely a computation mechanism, providing a mechanistic explanation for why sub-quadratic models fail at associative recall.

We identify a routing paradox in hybrid recurrent-attention architectures: content-based routing - deciding which tokens deserve expensive attention - requires exactly the pairwise computation that routing is designed to avoid. Through 20+ controlled experiments across three tasks (a synthetic diagnostic, the Zoology MQAR benchmark, and HotpotQA), we map the routing landscape exhaustively. One layer of softmax attention creates a latent ~34-dimensional subspace enabling 98.4% routing precision; zero layers yield 1.2%. This subspace is invisible to cosine similarity, destroyed by random projections (98.4% to 2.6%), and cannot be created by contrastive pretraining - proving attention's role is writing pairwise match results into representations, not merely computing them. Twelve alternative mechanisms all cluster at 15-29%. Non-learned indices (Bloom filter: 90.9%; BM25 on HotpotQA: 82.7%) bypass the bottleneck entirely. The result is a sharp two-regime hierarchy with an empty middle ground. These findings provide the mechanistic explanation for the empirical observation that recurrent models fail at associative recall, and reframe attention as a representation constructor rather than merely a computation mechanism.

Read abstractHide abstract

0

Deep Reinforcement Learning and The Tale of Two Temporal Difference Errors

cs.LG cs.AI Juan Sebastian Rojas, Chi-Guhn Lee · Mar 23, 2026

This paper identifies a subtle but important distinction between two interpretations of the TD error in reinforcement learning: the explicit form (bootstrapped target minus prediction) commonly used in deep RL, and the implicit form (difference between temporally successive predictions) from the original Sutton (1988) formulation. While equivalent in tabular settings, the authors demonstrate that increasingly nonlinear architectures cause these to diverge significantly, with profound implications for average-reward and differential RL algorithms.

The temporal difference (TD) error was first formalized in Sutton (1988), where it was first characterized as the difference between temporally successive predictions, and later, in that same work, formulated as the difference between a bootstrapped target and a prediction. Since then, these two interpretations of the TD error have been used interchangeably in the literature, with the latter eventually being adopted as the standard critic loss in deep reinforcement learning (RL) architectures. In this work, we show that these two interpretations of the TD error are not always equivalent. In particular, we show that increasingly-nonlinear deep RL architectures can cause these interpretations of the TD error to yield increasingly different numerical values. Then, building on this insight, we show how choosing one interpretation of the TD error over the other can affect the performance of deep RL algorithms that utilize the TD error to compute other quantities, such as with deep differential (i.e., average-reward) RL methods. All in all, our results show that the default interpretation of the TD error as the difference between a bootstrapped target and a prediction does not always hold in deep RL settings.

Read abstractHide abstract

0

Confidence Freeze: Early Success Induces a Metastable Decoupling of Metacognition and Behaviour

cs.LG Zhipeng Zhang, Hongshun He · Mar 22, 2026

This paper investigates why humans persist with failing strategies despite negative feedback, proposing 'confidence freeze'—a metastable state where early success decouples metacognitive confidence from behavior. Using a multi-reversal bandit task (N=332 across 3 experiments), the authors show that brief exposure to 90% success rates (vs. 60%) induces lock-in behavior where participants endure ~6 consecutive losses while reporting plummeting confidence, suggesting a dynamic mechanism rather than stable individual traits.

Humans must flexibly arbitrate between exploring alternatives and exploiting learned strategies, yet they frequently exhibit maladaptive persistence by continuing to execute failing strategies despite accumulating negative evidence. Here we propose a ``confidence-freeze'' account that reframes such persistence as a dynamic learning state rather than a stable dispositional trait. Using a multi-reversal two-armed bandit task across three experiments (total N = 332; 19,920 trials), we first show that human learners normally make use of the symmetric statistical structure inherent in outcome trajectories: runs of successes provide positive evidence for environmental stability and thus for strategy maintenance, whereas runs of failures provide negative evidence and should raise switching probability. Behaviour in the control group conformed to this normative pattern. However, individuals who experienced a high rate of early success (90\% vs.\ 60\%) displayed a robust and selective distortion after the first reversal: they persisted through long stretches of non-reward (mean = 6.2 consecutive losses) while their metacognitive confidence ratings simultaneously dropped from 5 to 2 on a 7-point scale.

Read abstractHide abstract

0

Beyond a Single Signal: SPECTREG2, A Unified MultiExpert Anomaly Detector for Unknown Unknowns

cs.LG cs.CV Rahul D Ray · Mar 22, 2026

SPECTRE-G2 tackles epistemic uncertainty in safety-critical systems by detecting 'unknown unknowns'—inputs that violate the structural assumptions of the training distribution. Unlike prior work that relies on single signals (confidence, density, or reconstruction error), this paper proposes a multi-expert architecture combining eight complementary signals from a dual-backbone network. The core idea is that diverse structural anomalies require diverse detection mechanisms. The method achieves strong empirical results across synthetic causal, tabular, image, and RL environments, though some baseline implementations appear problematic.

Epistemic intelligence requires machine learning systems to recognise the limits of their own knowledge and act safely under uncertainty, especially when faced with unknown unknowns. Existing uncertainty quantification methods rely on a single signal such as confidence or density and fail to detect diverse structural anomalies. We introduce SPECTRE-G2, a multi-signal anomaly detector that combines eight complementary signals from a dual-backbone neural network. The architecture includes a spectral normalised Gaussianization encoder, a plain MLP preserving feature geometry, and an ensemble of five models. These produce density, geometry, uncertainty, discriminative, and causal signals. Each signal is normalised using validation statistics and calibrated with synthetic out-of-distribution data. An adaptive top-k fusion selects the most informative signals and averages their scores. Experiments on synthetic, Adult, CIFAR-10, and Gridworld datasets show strong performance across diverse anomaly types, outperforming multiple baselines on AUROC, AUPR, and FPR95. The model is stable across seeds and particularly effective for detecting new variables and confounders. SPECTRE-G2 provides a practical approach for detecting unknown unknowns in open-world settings.

Read abstractHide abstract

0

Learning from Label Proportions with Dual-proportion Constraints

cs.LG Tianhao Ma, Ximing Li, Changchun Li et al. · Mar 22, 2026

Learning from Label Proportions (LLP) trains instance-level classifiers using only bag-level class proportions, addressing privacy constraints and annotation costs. This paper introduces LLP-DC, which enforces dual constraints: bag-level mean predictions align with given proportions, while instance-level training uses hard pseudo-labels generated via minimum-cost maximum-flow to strictly satisfy proportion constraints. The method offers a novel formulation of LLP as a candidate label assignment problem, achieving state-of-the-art results across standard vision benchmarks.

Learning from Label Proportions (LLP) is a weakly supervised problem in which the training data comprise bags, that is, groups of instances, each annotated only with bag-level class label proportions, and the objective is to learn a classifier that predicts instance-level labels. This setting is widely applicable when privacy constraints limit access to instance-level annotations or when fine-grained labeling is costly or impractical. In this work, we introduce a method that leverages Dual proportion Constraints (LLP-DC) during training, enforcing them at both the bag and instance levels. Specifically, the bag-level training aligns the mean prediction with the given proportion, and the instance-level training aligns hard pseudo-labels that satisfy the proportion constraint, where a minimum-cost maximum-flow algorithm is used to generate hard pseudo-labels. Extensive experimental results across various benchmark datasets empirically validate that LLP-DC consistently improves over previous LLP methods across datasets and bag sizes. The code is publicly available at https://github.com/TianhaoMa5/CV PR2026_Findings_LLP_DC.

Read abstractHide abstract

0

Interpreting the Synchronization Gap: The Hidden Mechanism Inside Diffusion Transformers

cs.LG cond-mat.dis-nn cond-mat.stat-mech Emil Albrychiewicz, Andr\'es Franco Valiente, Li-Ching Chen et al. · Mar 22, 2026

Recent theoretical models of diffusion as coupled Ornstein-Uhlenbeck processes predict a hierarchy of interaction timescales creating a synchronization gap between global and local committing modes. This work investigates how this gap mechanistically emerges within pretrained Diffusion Transformers by introducing a controlled architectural realization of replica coupling via symmetric cross-attention gates with strength $g$. Through linearized analysis and empirical probing of DiT-XL/2 across all 28 layers, the authors demonstrate that the gap is an intrinsic, depth-localized property that collapses under strong coupling as $\mathcal{O}(\frac{1-g}{1+g})$, providing a bridge between continuous statistical physics and discrete transformer dynamics.

Recent theoretical models of diffusion processes, conceptualized as coupled Ornstein-Uhlenbeck systems, predict a hierarchy of interaction timescales, and consequently, the existence of a synchronization gap between modes that commit at different stages of the reverse process. However, because these predictions rely on continuous time and analytically tractable score functions, it remains unclear how this phenomenology manifests in the deep, discrete architectures deployed in practice. In this work, we investigate how the synchronization gap is mechanistically realized within pretrained Diffusion Transformers (DiTs). We construct an explicit architectural realization of replica coupling by embedding two generative trajectories into a joint token sequence, modulated by a symmetric cross attention gate with variable coupling strength g. Through a linearized analysis of the attention difference, we show that the replica interaction decomposes mechanistically. We empirically validate our theoretical framework on a pretrained DiT-XL/2 model by tracking commitment and per layer internal mode energies. Our results reveal that: (1) the synchronization gap is an intrinsic architectural property of DiTs that persists even when external coupling is turned off; (2) as predicted by our spatial routing bounds, the gap completely collapses under strong coupling; (3) the gap is strictly depth localized, emerging sharply only within the final layers of the Transformer; and (4) global, low frequency structures consistently commit before local, high frequency details. Ultimately, our findings provide a mechanistic interpretation of how Diffusion Transformers resolve generative ambiguity, isolating speciation transitions to the terminal layers of the network.

Read abstractHide abstract

0

Joint Surrogate Learning of Objectives, Constraints, and Sensitivities for Efficient Multi-objective Optimization of Neural Dynamical Systems

cs.LG Frithjof Gressmann, Ivan Georgiev Raikov, Seung Hyun Kim et al. · Mar 22, 2026

Multi-objective optimization of expensive biophysical neural simulations is hindered by high-dimensional parameter spaces and binary constraints that partition the search space without gradient signals. This paper introduces dmosopt, a framework that jointly learns objectives, constraints, and parameter sensitivities in a single differentiable surrogate model $f: \mathbb{R}^n \rightarrow \mathbb{R}^{q+k}$. By computing a unified gradient $\mathbf{g}_{\text{sopt}}$ that simultaneously steers toward improved objective values and greater constraint satisfaction, the method navigates feasibility manifolds that defeat standard approaches, achieving substantial speedups on problems ranging from single-cell models to million-neuron networks.

Biophysical neural system simulations are among the most computationally demanding scientific applications, and their optimization requires navigating high-dimensional parameter spaces under numerous constraints that impose a binary feasible/infeasible partition with no gradient signal to guide the search. Here, we introduce DMOSOPT, a scalable optimization framework that leverages a unified, jointly learned surrogate model to capture the interplay between objectives, constraints, and parameter sensitivities. By learning a smooth approximation of both the objective landscape and the feasibility boundary, the joint surrogate provides a unified gradient that simultaneously steers the search toward improved objective values and greater constraint satisfaction, while its partial derivatives yield per-parameter sensitivity estimates that enable more targeted exploration. We validate the framework from single-cell dynamics to population-level network activity, spanning incremental stages of a neural circuit modeling workflow, and demonstrate efficient, effective optimization of highly constrained problems at supercomputing scale with substantially fewer problem evaluations. While motivated by and demonstrated in the context of computational neuroscience, the framework is general and applicable to constrained multi-objective optimization problems across scientific and engineering domains.

Read abstractHide abstract

0

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

cs.CR cs.AI cs.LG Tom Biskupski, Stephan Kleber · Mar 23, 2026

Evaluating LLM outputs at scale remains a bottleneck for deploying safe AI systems. This paper conducts a comprehensive empirical study of 37 conversational LLMs serving as automated judges across eight security and quality assessment tasks. The work identifies viable open-source alternatives to GPT-4o for judgment tasks while demonstrating that popular techniques like second-level judging and specialized evaluator models underperform compared to well-prompted general models.

A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models' use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight different categories of judgment tasks and the corresponding ground-truth labels based on human assessments. Our empirical results show a high correlation of LLMs as judges with human assessments, when combined with a suitable prompt, in particular for GPT-4o, several open-source models with $\geqslant$ 32B parameters, and a few smaller models like Qwen2.5 14B.

Read abstractHide abstract

0

WorldCache: Content-Aware Caching for Accelerated Video World Models

cs.CV cs.AI cs.CL Umair Nawaz, Ahmed Heakl, Ufaq Khan et al. · Mar 23, 2026

WorldCache addresses the prohibitive latency of Diffusion Transformers (DiTs) for video world models by replacing static feature caching with a content-aware dynamical approximation framework. The method introduces motion-adaptive thresholds, saliency-weighted drift estimation, and optimal feature blending to eliminate ghosting artifacts during fast motion. Achieving 2.3× speedup on Cosmos-Predict2.5 with 99.4% quality retention, it offers a training-free path toward interactive world simulation.

Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose \textbf{WorldCache}, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves \textbf{2.3$\times$} inference speedup while preserving \textbf{99.4\%} of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on \href{https://umair1221.github.io/World-Cache/}{World-Cache}.

Read abstractHide abstract

0

Statistical Learning for Latent Embedding Alignment with Application to Brain Encoding and Decoding

stat.ME cs.LG Shuoxun Xu, Zhanhao Yan, Lexin Li · Mar 22, 2026

This paper addresses brain encoding and decoding by focusing on the alignment step between fMRI neural representations and visual stimulus embeddings. The authors propose two lightweight statistical learning methods—Inverse Semi-supervised Learning (ISL) and Meta Transfer Learning (MTL)—that operate with frozen encoders and decoders to improve sample efficiency under limited paired data and subject heterogeneity. The core innovation lies in leveraging abundant unpaired stimuli through inverse mapping with residual debiasing, and borrowing strength across subjects via sparse aggregation, all while maintaining rigorous theoretical guarantees.

Brain encoding and decoding aims to understand the relationship between external stimuli and brain activities, and is a fundamental problem in neuroscience. In this article, we study latent embedding alignment for brain encoding and decoding, with a focus on improving sample efficiency under limited fMRI-stimulus paired data and substantial subject heterogeneity. We propose a lightweight alignment framework equipped with two statistical learning components: inverse semi-supervised learning that leverages abundant unpaired stimulus embeddings through inverse mapping and residual debiasing, and meta transfer learning that borrows strength from pretrained models across subjects via sparse aggregation and residual correction. Both methods operate exclusively at the alignment stage while keeping encoders and decoders frozen, allowing for efficient computation, modular deployment, and rigorous theoretical analysis. We establish finite-sample generalization bounds and safety guarantees, and demonstrate competitive empirical performance on the large-scale fMRI-image reconstruction benchmark data.

Read abstractHide abstract

0

{\lambda}-GELU: Learning Gating Hardness for Controlled ReLU-ization in Deep Networks

cs.LG cs.AI Cristian P\'erez-Corral, Alberto Fern\'andez-Hern\'andez, Jose I. Mestre et al. · Mar 23, 2026

This work attacks the friction between smooth GELU training (ubiquitous in Transformers) and piecewise-linear deployment pipelines (quantization, formal verification). The authors parametrize GELU as $f(x;\lambda) = x\Phi(\lambda x)$ with learnable sharpness $\lambda \geq 1$, deriving a principled annealing target from an $\ell_1$ approximation bound to the Heaviside step. While the hardening protocol reduces validation-drop upon ReLU substitution in vision and tabular tasks, the 25% annealing switch is heuristic and actual downstream benefits in integer-only inference or SMT verification remain unevaluated.

Gaussian Error Linear Unit (GELU) is a widely used smooth alternative to Rectifier Linear Unit (ReLU), yet many deployment, compression, and analysis toolchains are most naturally expressed for piecewise-linear (ReLU-type) networks. We study a hardness-parameterized formulation of GELU, f(x;{\lambda})=x{\Phi}({\lambda} x), where {\Phi} is the Gaussian CDF and {\lambda} \in [1, infty) controls gate sharpness, with the goal of turning smooth gated training into a controlled path toward ReLU-compatible models. Learning {\lambda} is non-trivial: naive updates yield unstable dynamics and effective gradient attenuation, so we introduce a constrained reparameterization and an optimizer-aware update scheme. Empirically, across a diverse set of model--dataset pairs spanning MLPs, CNNs, and Transformers, we observe structured layerwise hardness profiles and assess their robustness under different initializations. We further study a deterministic ReLU-ization strategy in which the learned gates are progressively hardened toward a principled target, enabling a post-training substitution of {\lambda}-GELU by ReLU with reduced disruption. Overall, {\lambda}-GELU provides a minimal and interpretable knob to profile and control gating hardness, bridging smooth training with ReLU-centric downstream pipelines.

Read abstractHide abstract

0

Extending Precipitation Nowcasting Horizons via Spectral Fusion of Radar Observations and Foundation Model Priors

cs.LG cs.AI Yuze Qin, Qingyong Li, Zhiqing Guo et al. · Mar 23, 2026

PW-FouCast addresses the degradation of radar-only precipitation nowcasting at long lead times by proposing a frequency-domain fusion framework that integrates Pangu-Weather foundation model priors with radar observations. The core insight is that meteorological forecasts and radar reflectivity share similar phase structure despite differing amplitudes, enabling spectral alignment through phase-aware modulation and memory-based correction. The approach achieves quantitative improvements on standard benchmarks and offers a novel alternative to spatial fusion methods.

Precipitation nowcasting is critical for disaster mitigation and aviation safety. However, radar-only models frequently suffer from a lack of large-scale atmospheric context, leading to performance degradation at longer lead times. While integrating meteorological variables predicted by weather foundation models offers a potential remedy, existing architectures fail to reconcile the profound representational heterogeneities between radar imagery and meteorological data. To bridge this gap, we propose PW-FouCast, a novel frequency-domain fusion framework that leverages Pangu-Weather forecasts as spectral priors within a Fourier-based backbone. Our architecture introduces three key innovations: (i) Pangu-Weather-guided Frequency Modulation to align spectral magnitudes and phases with meteorological priors; (ii) Frequency Memory to correct phase discrepancies and preserve temporal evolution; and (iii) Inverted Frequency Attention to reconstruct high-frequency details typically lost in spectral filtering. Extensive experiments on the SEVIR and MeteoNet benchmarks demonstrate that PW-FouCast achieves state-of-the-art performance, effectively extending the reliable forecast horizon while maintaining structural fidelity. Our code is available at https://github.com/Onemissed/PW-FouCast.

Read abstractHide abstract

0

End-to-End Training for Unified Tokenization and Latent Denoising

cs.CV cs.AI cs.GR Shivam Duggal, Xingjian Bai, Zongze Wu et al. · Mar 23, 2026

Traditional latent diffusion models require staging—first train a VAE tokenizer, freeze it, then train a diffusion model on top. UNITE proposes a single-stage approach where a shared "Generative Encoder" serves as both tokenizer and denoiser via weight sharing, achieving FID 1.73 on ImageNet 256×256 without adversarial losses or pretrained encoders like DINOv2.

Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the same Generative Encoder. The shared parameters enable gradients to jointly shape the latent space, encouraging a "common latent language". Across image and molecule modalities, UNITE achieves near state of the art performance without adversarial losses or pretrained encoders (e.g., DINO), reaching FID 2.12 and 1.73 for Base and Large models on ImageNet 256 x 256. We further analyze the Generative Encoder through the lenses of representation alignment and compression. These results show that single stage joint training of tokenization & generation from scratch is feasible.

Read abstractHide abstract

Nothing here yet