Nothing here yet
The paper challenges the rapid shift toward Vision Transformer-based continual learning by demonstrating that lightweight, pruned Convolutional Networks can outperform existing foundation model approaches. The authors propose Pruned Adaptation Modules (PAM), which freeze early ResNet layers and introduce sparsely structured task-specific modules, yielding significant parameter reductions while improving accuracy. This work fills a critical methodological gap by establishing a strong, efficient baseline that questions whether recent advances reflect genuine progress or merely the absence of rigorous ConvNet comparisons.
This paper benchmarks classical statistical models (LR, SARIMAX), deep learning approaches (MLP, LSTM), and physics-guided variants for multi-horizon AQI forecasting in Dallas County, North Texas. The core innovation is incorporating EPA breakpoint-based AQI formulations as consistency constraints via weighted loss functions ($\mathcal{L}_{total} = \lambda_{data}\mathcal{L}_{data} + \lambda_{phys}\mathcal{L}_{phys}$). The work addresses a practical need for standardized regional model comparison to guide public health decision-making.
This paper compares classical machine learning methods (Linear Regression, SVM, Logistic Regression) for predicting vehicle fuel consumption using the 1974 Motor Trend dataset (N=398). The author argues that these "interpretable" models outperform "black box" deep learning approaches for static physical datasets—a claim that relies on a false equivalence between 50-year-old tabular data and modern time-series telematics applications.
This exploratory study investigates using TabPFN—a transformer-based tabular foundation model—and its extension library for geotechnical site characterization. The core idea is to leverage in-context learning to perform soil classification and multivariate parameter imputation without model retraining or hyperparameter tuning, while obtaining interpretable insights through embeddings, posterior distributions, and SHAP analysis. This matters because geotechnical engineering requires uncertainty-aware, interpretable predictions for safety-critical decisions, yet faces severe data scarcity.
This paper addresses BLE-based indoor localization in care facilities by shifting from independent-window classification to sequential learning. The proposed DASEL framework combines frequency-based feature engineering, bidirectional GRUs with attention mechanisms, and a two-level hierarchical ensemble to model temporal movement trajectories. Achieving a 53.1% improvement over traditional baselines on the ABC 2026 challenge dataset, the work demonstrates that capturing temporal dependencies is critical for accurate indoor localization in complex real-world environments.
Zeroth-order (ZO) optimization enables memory-efficient training via forward-only gradient estimation, but its stochastic nature obscures training dynamics compared to well-characterized first-order (FO) methods. This paper introduces the Neural Zeroth-order Kernel (NZK) to describe model evolution in function space under ZO updates, proving that the expected NZK remains time-invariant for linear models and depends explicitly on the moments of random perturbation directions. The work extends to linearized neural networks and proposes using a single shared random vector to accelerate convergence, with experiments on synthetic and real-world datasets (MNIST, CIFAR-10, Tiny ImageNet) validating the theoretical predictions.
Cross-Layer Transcoders (CLTs) compress the attribution graphs used in mechanistic interpretability by sharing features across transformer layers, but their quadratic parameter scaling ($N_{\text{CLT}} \propto L^2$) makes training and analysis prohibitively expensive for most researchers. This paper introduces CLT-Forge, an open-source library that combines feature-sharded distributed training, compressed activation caching (int8/int4/int2 with zstd), automated interpretability pipelines, and integration with Circuit-Tracer to provide the first unified workflow for end-to-end CLT analysis at scale.
This paper investigates a fundamental paradox in hybrid sequence models: content-based routing requires exactly the pairwise computation it aims to avoid. Through 20+ controlled experiments, the authors demonstrate that one layer of softmax attention creates a latent $\sim$34-dimensional subspace via value aggregation, enabling 98.4% routing precision, while all alternatives (recurrence, linear attention, contrastive pretraining) cluster at 1–29%. These findings reframe attention as a representation constructor rather than merely a computation mechanism, providing a mechanistic explanation for why sub-quadratic models fail at associative recall.
This paper identifies a subtle but important distinction between two interpretations of the TD error in reinforcement learning: the explicit form (bootstrapped target minus prediction) commonly used in deep RL, and the implicit form (difference between temporally successive predictions) from the original Sutton (1988) formulation. While equivalent in tabular settings, the authors demonstrate that increasingly nonlinear architectures cause these to diverge significantly, with profound implications for average-reward and differential RL algorithms.
This paper investigates why humans persist with failing strategies despite negative feedback, proposing 'confidence freeze'—a metastable state where early success decouples metacognitive confidence from behavior. Using a multi-reversal bandit task (N=332 across 3 experiments), the authors show that brief exposure to 90% success rates (vs. 60%) induces lock-in behavior where participants endure ~6 consecutive losses while reporting plummeting confidence, suggesting a dynamic mechanism rather than stable individual traits.
SPECTRE-G2 tackles epistemic uncertainty in safety-critical systems by detecting 'unknown unknowns'—inputs that violate the structural assumptions of the training distribution. Unlike prior work that relies on single signals (confidence, density, or reconstruction error), this paper proposes a multi-expert architecture combining eight complementary signals from a dual-backbone network. The core idea is that diverse structural anomalies require diverse detection mechanisms. The method achieves strong empirical results across synthetic causal, tabular, image, and RL environments, though some baseline implementations appear problematic.
Learning from Label Proportions (LLP) trains instance-level classifiers using only bag-level class proportions, addressing privacy constraints and annotation costs. This paper introduces LLP-DC, which enforces dual constraints: bag-level mean predictions align with given proportions, while instance-level training uses hard pseudo-labels generated via minimum-cost maximum-flow to strictly satisfy proportion constraints. The method offers a novel formulation of LLP as a candidate label assignment problem, achieving state-of-the-art results across standard vision benchmarks.
Recent theoretical models of diffusion as coupled Ornstein-Uhlenbeck processes predict a hierarchy of interaction timescales creating a synchronization gap between global and local committing modes. This work investigates how this gap mechanistically emerges within pretrained Diffusion Transformers by introducing a controlled architectural realization of replica coupling via symmetric cross-attention gates with strength $g$. Through linearized analysis and empirical probing of DiT-XL/2 across all 28 layers, the authors demonstrate that the gap is an intrinsic, depth-localized property that collapses under strong coupling as $\mathcal{O}(\frac{1-g}{1+g})$, providing a bridge between continuous statistical physics and discrete transformer dynamics.
Multi-objective optimization of expensive biophysical neural simulations is hindered by high-dimensional parameter spaces and binary constraints that partition the search space without gradient signals. This paper introduces dmosopt, a framework that jointly learns objectives, constraints, and parameter sensitivities in a single differentiable surrogate model $f: \mathbb{R}^n \rightarrow \mathbb{R}^{q+k}$. By computing a unified gradient $\mathbf{g}_{\text{sopt}}$ that simultaneously steers toward improved objective values and greater constraint satisfaction, the method navigates feasibility manifolds that defeat standard approaches, achieving substantial speedups on problems ranging from single-cell models to million-neuron networks.
Evaluating LLM outputs at scale remains a bottleneck for deploying safe AI systems. This paper conducts a comprehensive empirical study of 37 conversational LLMs serving as automated judges across eight security and quality assessment tasks. The work identifies viable open-source alternatives to GPT-4o for judgment tasks while demonstrating that popular techniques like second-level judging and specialized evaluator models underperform compared to well-prompted general models.
WorldCache addresses the prohibitive latency of Diffusion Transformers (DiTs) for video world models by replacing static feature caching with a content-aware dynamical approximation framework. The method introduces motion-adaptive thresholds, saliency-weighted drift estimation, and optimal feature blending to eliminate ghosting artifacts during fast motion. Achieving 2.3× speedup on Cosmos-Predict2.5 with 99.4% quality retention, it offers a training-free path toward interactive world simulation.
This paper addresses brain encoding and decoding by focusing on the alignment step between fMRI neural representations and visual stimulus embeddings. The authors propose two lightweight statistical learning methods—Inverse Semi-supervised Learning (ISL) and Meta Transfer Learning (MTL)—that operate with frozen encoders and decoders to improve sample efficiency under limited paired data and subject heterogeneity. The core innovation lies in leveraging abundant unpaired stimuli through inverse mapping with residual debiasing, and borrowing strength across subjects via sparse aggregation, all while maintaining rigorous theoretical guarantees.
This work attacks the friction between smooth GELU training (ubiquitous in Transformers) and piecewise-linear deployment pipelines (quantization, formal verification). The authors parametrize GELU as $f(x;\lambda) = x\Phi(\lambda x)$ with learnable sharpness $\lambda \geq 1$, deriving a principled annealing target from an $\ell_1$ approximation bound to the Heaviside step. While the hardening protocol reduces validation-drop upon ReLU substitution in vision and tabular tasks, the 25% annealing switch is heuristic and actual downstream benefits in integer-only inference or SMT verification remain unevaluated.
PW-FouCast addresses the degradation of radar-only precipitation nowcasting at long lead times by proposing a frequency-domain fusion framework that integrates Pangu-Weather foundation model priors with radar observations. The core insight is that meteorological forecasts and radar reflectivity share similar phase structure despite differing amplitudes, enabling spectral alignment through phase-aware modulation and memory-based correction. The approach achieves quantitative improvements on standard benchmarks and offers a novel alternative to spatial fusion methods.
Traditional latent diffusion models require staging—first train a VAE tokenizer, freeze it, then train a diffusion model on top. UNITE proposes a single-stage approach where a shared "Generative Encoder" serves as both tokenizer and denoiser via weight sharing, achieving FID 1.73 on ImageNet 256×256 without adversarial losses or pretrained encoders like DINOv2.