Feed - arxlens

0

Model selection in hybrid quantum neural networks with applications to quantum transformer architectures

quant-ph cs.LG Harsh Wadhwa, Rahul Bhowmick, Naipunnya Raj et al. · Mar 23, 2026

Quantum machine learning model selection currently lacks principled guidelines, forcing practitioners to train numerous expensive configurations. This paper introduces QBET (Quantum Bias-Expressivity Toolbox), an unsupervised pre-screening framework that evaluates hybrid quantum-classical transformers using LZ-complexity-based Simplicity Bias (AUC) and Expressivity metrics without gradient descent. The core idea is that architectures with higher AUC (stronger bias toward simple Boolean functions) correlate with better downstream task performance, offering a filter to identify promising quantum attention variants before committing to full training on NISQ devices.

Quantum machine learning models generally lack principled design guidelines, often requiring full resource-intensive training across numerous choices of encodings, quantum circuit designs and initialization strategies to find effective configuration. To address this challenge, we develope the Quantum Bias-Expressivity Toolbox ($\texttt{QBET}$), a framework for evaluating quantum, classical, and hybrid transformer architectures. In this toolbox, we introduce lean metrics for Simplicity Bias ($\texttt{SB}$) and Expressivity ($\texttt{EXP}$), for comparing across various models, and extend the analysis of $\texttt{SB}$ to generative and multiclass-classification tasks. We show that $\texttt{QBET}$ enables efficient pre-screening of promising model variants obviating the need to execute complete training pipelines. In evaluations on transformer-based classification and generative tasks we employ a total of $18$ qubits for embeddings ($6$ qubits each for query, key, and value). We identify scenarios in which quantum self-attention variants surpass their classical counterparts by ranking the respective models according to the $\texttt{SB}$ metric and comparing their relative performance.

Read abstractHide abstract

0

AdditiveLLM2: A Multi-modal Large Language Model for Additive Manufacturing

cs.LG Peter Pak, Amir Barati Farimani · Mar 23, 2026

AdditiveLLM2 is a domain-adapted multi-modal LLM for additive manufacturing built by fine-tuning Gemma 3 12B on ~50 million tokens from open-access AM journal articles. The work addresses the challenge of specializing general LLMs for technical domains without consuming context window space (as with RAG) or requiring massive datasets. Using domain adaptive pretraining (DAPT) for both text and vision plus visual instruction tuning (VIT), the authors demonstrate that even relatively small curated datasets can yield domain expertise exceeding 90% accuracy on AM knowledge tasks.

This work presents AdditiveLLM2 a multi-modal, domain adapted large language model built upon the instruction tuned variant of the Gemma 3 model using a relatively small dataset of around 50 million tokens. The dataset (AdditiveLLM2-OA) consists of open-access additive manufacturing journal articles with data extracted for the domain adaptive pretraining and visual instruction tuning processes. Various stages of the developed model are evaluated with the Additive-Manufacturing-Benchmark which consists of additive manufacturing domain specific tasks compiled published resources. AdditiveLLM2 exhibits proficiency in both language and vision based tasks, achieving accuracies upwards of 90% in general additive manufacturing knowledge. This domain adaptive pretraining and instruction tuning strategy outline an accessible specialization method for large language models to a domain such as additive manufacturing.

Read abstractHide abstract

0

A Generalised Exponentiated Gradient Approach to Enhance Fairness in Binary and Multi-class Classification Tasks

cs.LG stat.ML Maryam Boubekraoui, Giordano d'Aloisio, Antinisca Di Marco · Mar 22, 2026

While most bias mitigation research targets binary classification, multi-class fairness remains under-explored. This paper proposes Generalised Exponentiated Gradient (GEG), an in-processing method that extends the Exponentiated Gradient framework to multi-class settings and enables simultaneous optimization of multiple fairness constraints via positive-label moment conditions. Evaluated on ten datasets against six baselines, GEG achieves fairness improvements up to 92% with moderate accuracy trade-offs, filling a critical gap in fair machine learning toolboxes.

The widespread use of AI and ML models in sensitive areas raises significant concerns about fairness. While the research community has introduced various methods for bias mitigation in binary classification tasks, the issue remains under-explored in multi-class classification settings. To address this limitation, in this paper, we first formulate the problem of fair learning in multi-class classification as a multi-objective problem between effectiveness (i.e., prediction correctness) and multiple linear fairness constraints. Next, we propose a Generalised Exponentiated Gradient (GEG) algorithm to solve this task. GEG is an in-processing algorithm that enhances fairness in binary and multi-class classification settings under multiple fairness definitions. We conduct an extensive empirical evaluation of GEG against six baselines across seven multi-class and three binary datasets, using four widely adopted effectiveness metrics and three fairness definitions. GEG overcomes existing baselines, with fairness improvements up to 92% and a decrease in accuracy up to 14%.

Read abstractHide abstract

0

RAMPAGE: RAndomized Mid-Point for debiAsed Gradient Extrapolation

cs.LG math.OC Abolfazl Hashemi · Mar 23, 2026

RAMPAGE addresses discretization bias in Extragradient (EG) methods for variational inequalities by replacing the deterministic midpoint with randomized sampling. The core idea uses uniform sampling to construct an unbiased estimator of the continuous-time flow integral, while RAMPAGE+ leverages antithetic variates to eliminate first-order variance terms. This matters for training GANs and other non-conservative games where EG's $\mathcal{O}(\eta^2)$ bias causes divergence in highly nonlinear regimes.

A celebrated method for Variational Inequalities (VIs) is Extragradient (EG), which can be viewed as a standard discrete-time integration scheme. With this view in mind, in this paper we show that EG may suffer from discretization bias when applied to non-linear vector fields, conservative or otherwise. To resolve this discretization shortcoming, we introduce RAndomized Mid-Point for debiAsed Gradient Extrapolation (RAMPAGE) and its variance-reduced counterpart, RAMPAGE+ which leverages antithetic sampling. In contrast with EG, both methods are unbiased. Furthermore, leveraging negative correlation, RAMPAGE+ acts as an unbiased, geometric path-integrator that completely removes internal first-order terms from the variance, provably improving upon RAMPAGE. We further demonstrate that both methods enjoy provable $\mathcal{O}(1/k)$ convergence guarantees for a range of problems including root finding under co-coercive, co-hypomonotone, and generalized Lipschitzness regimes. Furthermore, we introduce symmetrically scaled variants to extend our results to constrained VIs. Finally, we provide convergence guarantees of both methods for stochastic and deterministic smooth convex-concave games. Somewhat interestingly, despite being a randomized method, RAMPAGE+ attains purely deterministic bounds for a number of the studied settings.

Read abstractHide abstract

0

Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models

cs.CL cs.LG Jinghan Cao, Yu Ma, Xinjin Li et al. · Mar 22, 2026

This paper addresses the high computational cost of deploying Large Language Models (LLMs) in resource-constrained environments by introducing the Performance-Efficiency Ratio (PER), a novel metric that integrates accuracy, throughput, memory, and latency via geometric mean normalization. The authors evaluate 16 open-source language models ranging from 0.5B to 72B parameters across five NLP tasks (IMDB, HellaSwag, ARC-Easy, SQuAD 2.0, and GSM8K), concluding that small models (0.5–3B parameters) consistently achieve superior PER scores compared to their larger counterparts.

Large Language Models achieve remarkable performance but incur substantial computational costs unsuitable for resource-constrained deployments. This paper presents the first comprehensive task-specific efficiency analysis comparing 16 language models across five diverse NLP tasks. We introduce the Performance-Efficiency Ratio (PER), a novel metric integrating accuracy, throughput, memory, and latency through geometric mean normalization. Our systematic evaluation reveals that small models (0.5--3B parameters) achieve superior PER scores across all given tasks. These findings establish quantitative foundations for deploying small models in production environments prioritizing inference efficiency over marginal accuracy gains.

Read abstractHide abstract

0

CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

cs.LG q-bio.QM Dongxia Wu, Shiye Su, Yuhui Zhang et al. · Mar 23, 2026

Virtual cell modeling aims to simulate cellular responses to drug perturbations in silico, but existing flow-matching models optimize only pixel-level reconstruction and can produce biologically implausible outputs like nuclei outside cytoplasm. CellFluxRL addresses this by post-training the state-of-the-art CellFlux model with reinforcement learning, using seven manually designed reward functions spanning biological function (mode of action), structural validity (nuclear containment), and morphological statistics (size/count). The approach reveals a systematic framework for enforcing physical constraints through differentiable optimization, achieving consistent improvements across all biological metrics while maintaining image quality.

Building virtual cells with generative models to simulate cellular behavior in silico is emerging as a promising paradigm for accelerating drug discovery. However, prior image-based generative approaches can produce implausible cell images that violate basic physical and biological constraints. To address this, we propose to post-train virtual cell models with reinforcement learning (RL), leveraging biologically meaningful evaluators as reward functions. We design seven rewards spanning three categories-biological function, structural validity, and morphological correctness-and optimize the state-of-the-art CellFlux model to yield CellFluxRL. CellFluxRL consistently improves over CellFlux across all rewards, with further performance boosts from test-time scaling. Overall, our results present a virtual cell modeling framework that enforces physically-based constraints through RL, advancing beyond "visually realistic" generations towards "biologically meaningful" ones.

Read abstractHide abstract

0

The Dual Mechanisms of Spatial Reasoning in Vision-Language Models

cs.CV cs.LG Kelly Cui, Nikhil Prakash, Ayush Raina et al. · Mar 23, 2026

This paper investigates how vision-language models (VLMs) perform spatial reasoning—the binding of objects to spatial relations. It reveals that VLMs rely on two concurrent mechanisms: a dominant one where the vision encoder encodes object layout globally across visual tokens (extending into background regions), and a secondary one where the language model backbone forms ordering representations over object tokens. The finding that enhancing these vision-derived spatial representations improves performance without fine-tuning challenges the prevailing focus on LM backbones and highlights the critical role of vision encoders in multimodal reasoning.

Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial reasoning performance on naturalistic images. Together, our results clarify how spatial association is computed within VLMs and highlight the central role of vision encoders in enabling spatial reasoning.

Read abstractHide abstract

0

CayleyPy-4: AI-Holography. Towards analogs of holographic string dualities for AI tasks

hep-th cs.AI cs.LG A. Chervov, F. Levkovich-Maslyuk, A. Smolensky et al. · Mar 23, 2026

This paper proposes a bold interdisciplinary bridge between holographic string dualities and artificial intelligence, hypothesizing that AI tasks such as language modeling can be viewed as particle trajectory prediction on graphs admitting a holographically dual "string" description. Drawing on the AdS/CFT correspondence, the authors conjecture that word metrics on $S_n$ Cayley graphs correspond to areas under lattice paths in dual planar polygons, verified computationally via their CayleyPy library.

This is the fourth paper in the CayleyPy project, which applies AI methods to the exploration of large graphs. In this work, we suggest the existence of a new discrete version of holographic string dualities for this setup, and discuss their relevance to AI systems and mathematics. Many modern AI tasks -- such as those addressed by GPT-style language models or RL systems -- can be viewed as direct analogues of predicting particle trajectories on graphs. We investigate this problem for a large family of Cayley graphs, for which we show that surprisingly it admits a dual description in terms of discrete strings. We hypothesize that such dualities may extend to a range of AI systems where they can lead to more efficient computational approaches. In particular, string holographic images of states are proposed as natural candidates for data embeddings, motivated by the "complexity = volume" principle in AdS/CFT. For Cayley graphs of the symmetric group S_n, our results indicate that the corresponding dual objects are flat, planar polygons. The diameter of the graph is equal to the number of integer points inside the polygon scaled by n. Vertices of the graph can be mapped holographically to paths inside the polygon, and the usual graph distances correspond to the area under the paths, thus directly realising the "complexity = volume" paradigm. We also find evidence for continuous CFTs and dual strings in the large n limit. We confirm this picture and other aspects of the duality in a large initial set of examples. We also present new datasets (obtained by a combination of ML and conventional tools) which should be instrumental in establishing the duality for more general cases.

Read abstractHide abstract

0

Multinoulli Extension: A Lossless Continuous Relaxation for Partition-Constrained Subset Selection

cs.LG math.OC Qixin Zhang, Wei Huang, Yan Sun et al. · Mar 23, 2026

The paper tackles partition-constrained subset selection for 'close-to-submodular' objectives—specifically α-weakly DR-submodular and (γ,β)-weakly submodular functions—where existing distorted local-search methods suffer from prohibitive query complexity (˜O(1/ϵ^6)) and require prior knowledge of structural parameters. The authors propose the Multinoulli Extension (ME), a continuous relaxation that learns multinoulli priors for each partition block, enabling lossless rounding without submodularity assumptions. They develop offline (Multinoulli-SCG) and online (Multinoulli-OSCG/OSGA) algorithms achieving tight approximation guarantees with O(1/ϵ^2) query complexity and O(√T) regret, respectively.

Identifying the most representative subset for a close-to-submodular objective while satisfying the predefined partition constraint is a fundamental task with numerous applications in machine learning. However, the existing distorted local-search methods are often hindered by their prohibitive query complexities and the rigid requirement for prior knowledge of difficult-to-obtain structural parameters. To overcome these limitations, we introduce a novel algorithm titled Multinoulli-SCG, which not only is parameter-free, but also can achieve the same approximation guarantees as the distorted local-search methods with significantly fewer function evaluations. More specifically, when the objective function is monotone $\alpha$-weakly DR-submodular or $(\gamma,\beta)$-weakly submodular, our Multinoulli-SCG algorithm can attain a value of $(1-e^{-\alpha})\text{OPT}-\epsilon$ or $(\frac{\gamma^{2}(1-e^{-(\beta(1-\gamma)+\gamma^2)})}{\beta(1-\gamma)+\gamma^2})\text{OPT}-\epsilon$ with only $O(1/\epsilon^{2})$ function evaluations, where OPT denotes the optimal value. The cornerstone of our Multinoulli-SCG algorithm is an innovative continuous-relaxation framework named Multinoulli Extension(ME), which can effectively convert the discrete subset selection problem subject to partition constraints into a solvable continuous maximization focused on learning the optimal multinoulli priors across the concerned partition. In sharp contrast with the well-established multi-linear extension for submodular subset selection, a notable advantage of our proposed ME is its intrinsic capacity to provide a lossless rounding scheme for any set function. Furthermore, based on our proposed ME, we also present two novel online algorithms, namely, Multinoulli-OSCG and Multinoulli-OSGA, for the unexplored online subset selection problems over partition constraints.

Read abstractHide abstract

0

FluidWorld: Reaction-Diffusion Dynamics as a Predictive Substrate for World Models

cs.LG Fabien Polly · Mar 22, 2026

FluidWorld tackles the quadratic cost and lack of spatial inductive bias in Transformer-based world models by replacing self-attention with reaction-diffusion PDEs. The core innovation is using PDE integration itself—governed by a discretized Laplacian and learned reaction terms—as the predictive engine, rather than as a physical simulator. This proof-of-concept demonstrates that at $\sim$800K parameters, such physics-inspired dynamics match or exceed attention and convolutional recurrence on spatial coherence metrics while offering $O(N)$ complexity, though at slower training speeds.

World models learn to predict future states of an environment, enabling planning and mental simulation. Current approaches default to Transformer-based predictors operating in learned latent spaces. This comes at a cost: O(N^2) computation and no explicit spatial inductive bias. This paper asks a foundational question: is self-attention necessary for predictive world modeling, or can alternative computational substrates achieve comparable or superior results? I introduce FluidWorld, a proof-of-concept world model whose predictive dynamics are governed by partial differential equations (PDEs) of reaction-diffusion type. Instead of using a separate neural network predictor, the PDE integration itself produces the future state prediction. In a strictly parameter-matched three-way ablation on unconditional UCF-101 video prediction (64x64, ~800K parameters, identical encoder, decoder, losses, and data), FluidWorld is compared against both a Transformer baseline (self-attention) and a ConvLSTM baseline (convolutional recurrence). While all three models converge to comparable single-step prediction loss, FluidWorld achieves 2x lower reconstruction error, produces representations with 10-15% higher spatial structure preservation and 18-25% more effective dimensionality, and critically maintains coherent multi-step rollouts where both baselines degrade rapidly. All experiments were conducted on a single consumer-grade PC (Intel Core i5, NVIDIA RTX 4070 Ti), without any large-scale compute. These results establish that PDE-based dynamics, which natively provide O(N) spatial complexity, adaptive computation, and global spatial coherence through diffusion, are a viable and parameter-efficient alternative to both attention and convolutional recurrence for world modeling.

Read abstractHide abstract

0

HamVision: Hamiltonian Dynamics as Inductive Bias for Medical Image Analysis

cs.CV cs.LG Mohamed A Mabrok · Mar 22, 2026

HamVision proposes using damped harmonic oscillator dynamics as a structured inductive bias for medical image analysis. The core idea is that phase-space decomposition yields three representations—position $q$ (features), momentum $p$ (gradients), and energy $H = rac{1}{2}|z|^2$ (saliency)—that serve both segmentation and classification tasks without modifying the shared bottleneck. This physics-constrained approach aims to replace generic learned transformations with interpretable, dynamics-based feature extraction across diverse medical imaging modalities.

We present HamVision, a framework for medical image analysis that uses the damped harmonic oscillator, a fundamental building block of signal processing, as a structured inductive bias for both segmentation and classification tasks. The oscillator's phase-space decomposition yields three functionally distinct representations: position~$q$ (feature content), momentum~$p$ (spatial gradients that encode boundary and texture information), and energy $H = \tfrac{1}{2}|z|^2$ (a parameter-free saliency map). These representations emerge from the dynamics, not from supervision, and can be exploited by different task-specific heads without any modification to the oscillator itself. For segmentation, energy gates the skip connections while momentum injects boundary information at every decoder level (HamSeg). For classification, the three representations are globally pooled and concatenated into a phase-space feature vector (HamCls). We evaluate HamVision across ten medical imaging benchmarks spanning five imaging modalities. On segmentation, HamSeg achieves state-of-the-art Dice scores on ISIC\,2018 (89.38\%), ISIC\,2017 (88.40\%), TN3K (87.05\%), and ACDC (92.40\%), outperforming most baselines with only 8.57M parameters. On classification, HamCls achieves state-of-the-art accuracy on BloodMNIST (98.85\%) and PathMNIST (96.65\%), and competitive results on the remaining MedMNIST datasets against MedMamba and MedViT. Diagnostic analysis confirms that the oscillator's momentum consistently encodes an interior$\,{>}\,$boundary$\,{>}\,$exterior gradient for segmentation and that the energy map correlates with discriminative regions for classification, properties that emerge entirely from the Hamiltonian dynamics. Code is available at https://github.com/Minds-R-Lab/hamvision.

Read abstractHide abstract

0

Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

cs.LG stat.ML Alexandra Zelenin, Alexandra Zhuravlyova · Mar 23, 2026

This paper tackles the memory explosion problem in high-rank DoRA fine-tuning. At $d_{in}=8192$ and rank $r=384$, computing the row-wise norm $\|\mathbf{W}+s\mathbf{B}\mathbf{A}\|_{\text{row}}$ via standard materialization consumes ~512 MB per module—prohibitive for large models with hundreds of adapted layers. The authors propose a factored norm decomposition that reduces the computation to $\mathcal{O}(d_{out}r+r^2)$ intermediates plus fused Triton kernels that collapse the composition into a single pass. On 8–32B vision-language models, this yields 1.5–2.0× speedups and up to 77 GB VRAM savings without numerical drift.

Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice. Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT's DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.

Read abstractHide abstract

0

Uncertainty Quantification for Distribution-to-Distribution Flow Matching in Scientific Imaging

cs.LG Dongxia Wu, Yuhui Zhang, Serena Yeung-Levy et al. · Mar 23, 2026

This paper addresses uncertainty quantification (UQ) for distribution-to-distribution flow matching, a setting where models map between well-defined source and target distributions (e.g., unperturbed to drug-treated cell images) rather than noise-to-data. The authors propose Bayesian Stochastic Flow Matching (BSFM), which combines Stochastic Flow Matching (SFM) for capturing aleatoric uncertainty via learnable diffusion terms, with MCD-Antithetic—a scalable Bayesian method using Monte Carlo Dropout and antithetic sampling—to decompose total uncertainty into aleatoric and epistemic components for reliable out-of-distribution (OOD) detection in scientific imaging.

Distribution-to-distribution generative models support scientific imaging tasks ranging from modeling cellular perturbation responses to translating medical images across conditions. Trustworthy generation requires both reliability (generalization across labs, devices, and experimental conditions) and accountability (detecting out-of-distribution cases where predictions may be unreliable). Uncertainty quantification (UQ) based approaches serve as promising candidates for these tasks, yet UQ for distribution-to-distribution generative models remains underexplored. We present a unified UQ framework, Bayesian Stochastic Flow Matching (BSFM), that disentangles aleatoric and epistemic uncertainty. The Stochastic Flow Matching (SFM) component augments deterministic flows with a diffusion term to improve model generalization to unseen scenarios. For UQ, we develop a scalable Bayesian approach -- MCD-Antithetic -- that combines Monte Carlo Dropout with sample-efficient antithetic sampling to produce effective anomaly scores for out-of-distribution detection. Experiments on cellular imaging (BBBC021, JUMP) and brain fMRI (Theory of Mind) across diverse scenarios show that SFM improves reliability while MCD-Antithetic enhances accountability.

Read abstractHide abstract

0

Data-Free Layer-Adaptive Merging via Fisher Information for Long-to-Short Reasoning LLMs

cs.LG Tian Xia · Mar 23, 2026

This paper tackles the Long-to-Short (L2S) model merging problem: combining a base LLM with a long-chain-of-thought reasoning model to preserve accuracy while drastically reducing output length. The core contribution is a theoretical framework proving that merging error is bounded by the per-layer Hessian norm (Proposition 1), which motivates using the diagonal Fisher Information Matrix (FIM) as a data-free proxy for assigning layer-adaptive merging coefficients. The resulting FIM-TIES method achieves state-of-the-art results on 5 of 6 benchmarks without requiring any domain-specific calibration data.

Model merging has emerged as a practical approach to combine capabilities of specialized large language models (LLMs) without additional training. In the Long-to-Short (L2S) scenario, merging a base model with a long-chain-of-thought reasoning model aims to preserve reasoning accuracy while reducing output length. Existing methods rely on Task Arithmetic and its variants, which implicitly assume that model outputs vary linearly with the merging coefficient -- an assumption we show is systematically violated in L2S settings. We provide the first theoretical justification for layer-adaptive merging: we prove that merging error is bounded by a term proportional to the per-layer Hessian norm (Proposition~1), and establish that the Fisher Information Matrix (FIM) is a principled, computable proxy for this bound via the Fisher-Hessian equivalence at local optima. Building on this theory, we propose \textbf{FIM-Merging}, which computes diagonal FIM using only random token inputs (no domain-specific calibration data required) and uses it to assign per-layer merging coefficients. On the 7B L2S benchmark, FIM-TIES achieves state-of-the-art performance on five out of six evaluation benchmarks, including a \textbf{+6.2} point gain on MATH500 over ACM-TIES (90.2 vs.\ 84.0), while requiring no calibration data. On the 1.5B benchmark, FIM-TIES achieves an average accuracy of \textbf{47.3}, surpassing the previous best ACM-TIES (43.3) by \textbf{+3.9} points, while reducing average response length by \textbf{91.9\%} relative to the long-CoT model. Our framework also provides a unified theoretical explanation for why existing layer-adaptive methods such as ACM empirically outperform uniform merging.

Read abstractHide abstract

0

Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

cs.LG Zakaria Mhammedi, James Cohan · Mar 23, 2026

Hard-exploration problems in RL—such as Montezuma’s Revenge and sparse-reward robotic control—require finding rare trajectories where standard RL fails. This paper argues that using policy optimization to maximize intrinsic rewards is unnecessarily inefficient for mere state coverage. Instead, it proposes Go-With-Uncertainty (GowU), a tree-search method that decouples exploration from exploitation: it uses epistemic uncertainty to drive a Go-With-The-Winner particle population search, then distills discovered trajectories via supervised backward learning. The approach achieves state-of-the-art scores on hard Atari benchmarks with an order of magnitude fewer environment interactions than intrinsic-motivation baselines, and solves high-dimensional continuous-control tasks (Adroit, AntMaze) from pixels without demonstrations.

The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before.

Read abstractHide abstract

0

A plug-and-play approach with fast uncertainty quantification for weak lensing mass mapping

astro-ph.CO astro-ph.IM cs.LG Hubert Leterme, Andreas Tersenov, Jalal Fadili et al. · Mar 23, 2026

Paper introduces PnPMass, a plug-and-play framework for weak lensing mass mapping that reconciles reconstruction accuracy with practical deployment constraints of upcoming Stage-IV surveys. The key innovation is a carefully chosen data-fidelity operator that decouples denoiser training from observation-specific noise statistics, enabling a single trained model to handle varying survey conditions without retraining. Coupled with moment-network-based uncertainty quantification and conformal calibration, the method offers fast inference with coverage guarantees, addressing limitations of both end-to-end deep learning and costly MCMC sampling approaches.

Upcoming stage-IV surveys such as Euclid and Rubin will deliver vast amounts of high-precision data, opening new opportunities to constrain cosmological models with unprecedented accuracy. A key step in this process is the reconstruction of the dark matter distribution from noisy weak lensing shear measurements. Current deep learning-based mass mapping methods achieve high reconstruction accuracy, but either require retraining a model for each new observed sky region (limiting practicality) or rely on slow MCMC sampling. Efficient exploitation of future survey data therefore calls for a new method that is accurate, flexible, and fast at inference. In addition, uncertainty quantification with coverage guarantees is essential for reliable cosmological parameter estimation. We introduce PnPMass, a plug-and-play approach for weak lensing mass mapping. The algorithm produces point estimates by alternating between a gradient descent step with a carefully chosen data fidelity term, and a denoising step implemented with a single deep learning model trained on simulated data corrupted by Gaussian white noise. We also propose a fast, sampling-free uncertainty quantification scheme based on moment networks, with calibrated error bars obtained through conformal prediction to ensure coverage guarantees. Finally, we benchmark PnPMass against both model-driven and data-driven mass mapping techniques. PnPMass achieves performance close to that of state-of-the-art deep-learning methods while offering fast inference (converging in just a few iterations) and requiring only a single training phase, independently of the noise covariance of the observations. It therefore combines flexibility, efficiency, and reconstruction accuracy, while delivering tighter error bars than existing approaches, making it well suited for upcoming weak lensing surveys.

Read abstractHide abstract

0

Towards Multimodal Time Series Anomaly Detection with Semantic Alignment and Condensed Interaction

cs.LG Shiyan Hu, Jianxin Jin, Yang Shu et al. · Mar 23, 2026

MindTS tackles multimodal time series anomaly detection by fusing numerical time series with text from two sources: endogenous text (LLM-generated descriptions of patch statistics) and exogenous text (external reports). The core idea is to align these heterogeneous modalities via contrastive learning and filter textual redundancy using an Information Bottleneck-inspired content condenser before cross-modal reconstruction. This matters because real-world anomalies often manifest in contextual text (e.g., policy changes affecting stock prices) that pure numerical models miss.

Time series anomaly detection plays a critical role in many dynamic systems. Despite its importance, previous approaches have primarily relied on unimodal numerical data, overlooking the importance of complementary information from other modalities. In this paper, we propose a novel multimodal time series anomaly detection model (MindTS) that focuses on addressing two key challenges: (1) how to achieve semantically consistent alignment across heterogeneous multimodal data, and (2) how to filter out redundant modality information to enhance cross-modal interaction effectively. To address the first challenge, we propose Fine-grained Time-text Semantic Alignment. It integrates exogenous and endogenous text information through cross-view text fusion and a multimodal alignment mechanism, achieving semantically consistent alignment between time and text modalities. For the second challenge, we introduce Content Condenser Reconstruction, which filters redundant information within the aligned text modality and performs cross-modal reconstruction to enable interaction. Extensive experiments on six real-world multimodal datasets demonstrate that the proposed MindTS achieves competitive or superior results compared to existing methods. The code is available at: https://github.com/decisionintelligence/MindTS.

Read abstractHide abstract

0

dynActivation: A Trainable Activation Family for Adaptive Nonlinearity

cs.LG cs.CV Alois Bachmann · Mar 23, 2026

dynActivation addresses the rigidity of fixed activation functions by introducing per-layer trainable scalars that interpolate between a base nonlinearity and a linear path. The method adds only two parameters per layer ($\alpha_i$ and $\beta_i$) via $f_i(x) = \text{BaseAct}(x)(\alpha_i - \beta_i) + \beta_i x$, allowing adaptive nonlinearity allocation across depth. Results show strong vision benchmarks (+14% on CIFAR-10), robustness to extreme depth scaling (95%+ accuracy on 75-layer MNIST), and faster convergence (24% AUC reduction), though LLM perplexity gains vanish in long-run training.

This paper proposes $\mathrm{dynActivation}$, a per-layer trainable activation defined as $f_i(x) = \mathrm{BaseAct}(x)(\alpha_i - \beta_i) + \beta_i x$, where $\alpha_i$ and $\beta_i$ are lightweight learned scalars that interpolate between the base nonlinearity and a linear path and $\mathrm{BaseAct}(x)$ resembles any ReLU-like function. The static and dynamic ReLU-like variants are then compared across multiple vision tasks, language modeling tasks, and ablation studies. The results suggest that dynActivation variants tend to linearize deep layers while maintaining high performance, which can improve training efficiency by up to $+54\%$ over ReLU. On CIFAR-10, dynActivation(Mish) improves over static Mish by up to $+14.02\%$ on AttentionCNN with an average improvment by $+6.00\%$, with a $24\%$ convergence-AUC reduction relative to Mish (2120 vs. 2785). In a 1-to-75-layer MNIST depth-scaling study, dynActivation never drops below $95\%$ test accuracy ($95.3$--$99.3\%$), while ReLU collapses below $80\%$ at 25 layers. Under FGSM at $\varepsilon{=}0.08$, dynActivation(Mish) incurs a $55.39\%$ accuracy drop versus $62.79\%$ for ReLU ($7.40\%$ advantage). Transferred to language modeling, a new proposed dynActGLU-variant achieves a $10.3\%$ relative perplexity reduction over SwiGLU at 5620 steps (4.047 vs. 4.514), though the gap vanishes at 34300 steps.

Read abstractHide abstract

0

Learning Can Converge Stably to the Wrong Belief under Latent Reliability

cs.LG Zhipeng Zhang, Zhenjie Yao, Kai Li et al. · Mar 23, 2026

This paper investigates a fundamental failure mode in learning systems: when feedback reliability is unobservable (latent), standard algorithms can converge stably to systematically incorrect solutions while exhibiting normal optimization behavior (decreasing loss, vanishing gradients). The authors formalize this as a scale-dependent identifiability problem—single-step feedback is insufficient to distinguish reliable from biased experience, yet trajectory-level statistics carry separable signals. They propose the Monitor–Trust–Regulator (MTR) framework, which maintains a slow-timescale trust variable inferred from learning dynamics to modulate updates, enabling recovery from persistent bias.

Learning systems are typically optimized by minimizing loss or maximizing reward, assuming that improvements in these signals reflect progress toward the true objective. However, when feedback reliability is unobservable, this assumption can fail, and learning algorithms may converge stably to incorrect solutions. This failure arises because single-step feedback does not reveal whether an experience is informative or persistently biased. When information is aggregated over learning trajectories, however, systematic differences between reliable and unreliable regimes can emerge. We propose a Monitor-Trust-Regulator (MTR) framework that infers reliability from learning dynamics and modulates updates through a slow-timescale trust variable. Across reinforcement learning and supervised learning settings, standard algorithms exhibit stable optimization behavior while learning incorrect solutions under latent unreliability, whereas trust-modulated systems reduce bias accumulation and improve recovery. These results suggest that learning dynamics are not only optimization traces but also a source of information about feedback reliability.

Read abstractHide abstract

0

In-network Attack Detection with Federated Deep Learning in IoT Networks: Real Implementation and Analysis

cs.LG cs.CR Devashish Chaudhary, Sutharshan Rajasegarar, Shiva Raj Pokhrel et al. · Mar 23, 2026

This paper addresses the challenge of detecting network attacks in IoT environments while preserving data privacy and minimizing communication overhead. The authors propose a federated learning framework using lightweight autoencoders deployed directly on Raspberry Pi edge devices to detect anomalies in real-time through reconstruction error $\mathcal{E}(t)=\|x_{t}-\hat{x}_{t}\|^{2}$. A real-world testbed with ZigBee-enabled sensor nodes was constructed to evaluate the approach against redirection attacks, demonstrating that federated training can match centralized performance while significantly reducing data transmission from 4.5 MB to 378 KB.

The rapid expansion of the Internet of Things (IoT) and its integration with backbone networks have heightened the risk of security breaches. Traditional centralized approaches to anomaly detection, which require transferring large volumes of data to central servers, suffer from privacy, scalability, and latency limitations. This paper proposes a lightweight autoencoder-based anomaly detection framework designed for deployment on resource-constrained edge devices, enabling real-time detection while minimizing data transfer and preserving privacy. Federated learning is employed to train models collaboratively across distributed devices, where local training occurs on edge nodes and only model weights are aggregated at a central server. A real-world IoT testbed using Raspberry Pi sensor nodes was developed to collect normal and attack traffic data. The proposed federated anomaly detection system, implemented and evaluated on the testbed, demonstrates its effectiveness in accurately identifying network attacks. The communication overhead was reduced significantly while achieving comparable performance to the centralized method.

Read abstractHide abstract

Nothing here yet