Feed - arxlens

0

Generalized Discrete Diffusion from Snapshots

stat.ML cs.AI cs.CL Oussama Zekri, Th\'eo Uscidda, Nicolas Boull\'e et al. · Mar 22, 2026

Discrete diffusion models have been limited to simplistic noising schemes like uniform corruption or masking, restricting their ability to leverage semantic structure in large vocabularies. This paper introduces GDDS (Generalized Discrete Diffusion from Snapshots), a framework supporting arbitrary continuous-time Markov chain noising processes via exact uniformization-based sampling and a tractable snapshot-level ELBO. The work achieves state-of-the-art results on large-scale language modeling tasks, claiming to surpass autoregressive baselines for the first time at this scale.

We introduce Generalized Discrete Diffusion from Snapshots (GDDS), a unified framework for discrete diffusion modeling that supports arbitrary noising processes over large discrete state spaces. Our formulation encompasses all existing discrete diffusion approaches, while allowing significantly greater flexibility in the choice of corruption dynamics. The forward noising process relies on uniformization and enables fast arbitrary corruption. For the reverse process, we derive a simple evidence lower bound (ELBO) based on snapshot latents, instead of the entire noising path, that allows efficient training of standard generative modeling architectures with clear probabilistic interpretation. Our experiments on large-vocabulary discrete generation tasks suggest that the proposed framework outperforms existing discrete diffusion methods in terms of training efficiency and generation quality, and beats autoregressive models for the first time at this scale. We provide the code along with a blog post on the project page : \href{https://oussamazekri.fr/gdds}{https://oussamazekri.fr/gdds}.

Read abstractHide abstract

0

Quotient Geometry, Effective Curvature, and Implicit Bias in Simple Shallow Neural Networks

cs.LG cs.AI Hang-Cheng Dong, Pengcheng Cheng · Mar 23, 2026

This paper develops a differential-geometric framework for shallow neural networks that treats predictor classes rather than raw parameters as the fundamental objects. By quotienting out permutation and scaling symmetries on a regular set $\Theta_{\mathrm{reg}}$, the authors define a function-induced metric $g_\theta$ and an effective Hessian that removes spurious curvature degeneracies along symmetry orbits. The work connects implicit bias to quotient-level geometry, with concrete analysis for quadratic-activation models where parameters map explicitly to symmetric matrices $Q(\theta)=\sum_{i=1}^m a_i w_i w_i^\top$.

Overparameterized shallow neural networks admit substantial parameter redundancy: distinct parameter vectors may represent the same predictor due to hidden-unit permutations, rescalings, and related symmetries. As a result, geometric quantities computed directly in the ambient Euclidean parameter space can reflect artifacts of representation rather than intrinsic properties of the predictor. In this paper, we develop a differential-geometric framework for analyzing simple shallow networks through the quotient space obtained by modding out parameter symmetries on a regular set. We first characterize the symmetry and quotient structure of regular shallow-network parameters and show that the finite-sample realization map induces a natural metric on the quotient manifold. This leads to an effective notion of curvature that removes degeneracy along symmetry orbits and yields a symmetry-reduced Hessian capturing intrinsic local geometry. We then study gradient flows on the quotient and show that only the horizontal component of parameter motion contributes to first-order predictor evolution, while the vertical component corresponds purely to gauge variation. Finally, we formulate an implicit-bias viewpoint at the quotient level, arguing that meaningful complexity should be assigned to predictor classes rather than to individual parameter representatives. Our experiments confirm that ambient flatness is representation-dependent, that local dynamics are better organized by quotient-level curvature summaries, and that in underdetermined regimes, implicit bias is most naturally described in quotient coordinates.

Read abstractHide abstract

0

What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators

cs.LG cs.AI Xinyu Zhang · Mar 23, 2026

World models for reinforcement learning learn to simulate environment dynamics, yet what they represent internally remains unclear. This paper probes two architecturally distinct models—IRIS (a discrete token transformer) and DIAMOND (a continuous diffusion UNet)—on Atari Breakout and Pong using linear and MLP probes, causal interventions, and attention analysis to test whether they develop structured, interpretable representations of game state. The core finding is that world models develop approximately linear representations of salient state variables (ball position, score) that are not merely correlated but functionally used during prediction.

World models learn to simulate environment dynamics from experience, enabling sample-efficient reinforcement learning. But what do these models actually represent internally? We apply interpretability techniques--including linear and nonlinear probing, causal interventions, and attention analysis--to two architecturally distinct world models: IRIS (discrete token transformer) and DIAMOND (continuous diffusion UNet), trained on Atari Breakout and Pong. Using linear probes, we find that both models develop linearly decodable representations of game state variables (object positions, scores), with MLP probes yielding only marginally higher R^2, confirming that these representations are approximately linear. Causal interventions--shifting hidden states along probe-derived directions--produce correlated changes in model predictions, providing evidence that representations are functionally used rather than merely correlated. Analysis of IRIS attention heads reveals spatial specialization: specific heads attend preferentially to tokens overlapping with game objects. Multi-baseline token ablation experiments consistently identify object-containing tokens as disproportionately important. Our findings provide interpretability evidence that learned world models develop structured, approximately linear internal representations of environment state across two games and two architectures.

Read abstractHide abstract

0

DMMRL: Disentangled Multi-Modal Representation Learning via Variational Autoencoders for Molecular Property Prediction

cs.LG cs.AI Long Xu, Junping Guo, Jianbo Zhao et al. · Mar 22, 2026

DMMRL tackles molecular property prediction by addressing two key challenges: entangled representations that obscure structure-property relationships and naïve multi-modal fusion that ignores inter-modal dependencies. The method uses variational autoencoders to decompose graph, sequence, and geometry features into shared (structure-relevant) and private (modality-specific) latent subspaces, enforcing orthogonality between them. A gated attention mechanism then fuses only the shared representations for downstream prediction.

Molecular property prediction constitutes a cornerstone of drug discovery and materials science, necessitating models capable of disentangling complex structure-property relationships across diverse molecular modalities. Existing approaches frequently exhibit entangled representations--conflating structural, chemical, and functional factors--thereby limiting interpretability and transferability. Furthermore, conventional methods inadequately exploit complementary information from graphs, sequences, and geometries, often relying on naive concatenation that neglects inter-modal dependencies. In this work, we propose DMMRL, which employs variational autoencoders to disentangle molecular representations into shared (structure-relevant) and private (modality-specific) latent spaces, enhancing both interpretability and predictive performance. The proposed variational disentanglement mechanism effectively isolates the most informative features for property prediction, while orthogonality and alignment regularizations promote statistical independence and cross-modal consistency. Additionally, a gated attention fusion module adaptively integrates shared representations, capturing complex inter-modal relationships. Experimental validation across seven benchmark datasets demonstrates DMMRL's superior performance relative to state-of-the-art approaches. The code and data underlying this article are freely available at https://github.com/xulong0826/DMMRL.

Read abstractHide abstract

0

Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts

cs.LG cs.AI Andrei Baroian, Rutger Berger · Mar 22, 2026

GRPO training for LLM reasoning suffers from expensive rollouts and wasted compute on zero-variance prompts where all answers are correct or wrong. This paper proposes Prompt Replay, an overhead-free online method that buffers and reuses medium-difficulty prompts (pass rate near 0.5) to maximize gradient signal while staying on-policy by regenerating responses. By mixing replayed prompts with fresh samples and controlling reuse via cooldown steps and caps, the method aims to accelerate early training, though it eventually plateaus to baseline performance.

Reinforcement learning with verifiable rewards (RLVR) plays a crucial role in expanding the capacities of LLM reasoning, but GRPO-style training is dominated by expensive rollouts and wastes compute on unusable prompts. We propose Prompt Replay, an overhead-free online data selection method for GRPO that reuses prompts only (not trajectories), to preserve on-policy optimization. After each step, we insert prompts with medium difficulty into a buffer, and prioritize prompts closer to a pass rate of 0.5 (half answers correct, half wrong) to maximize the advantage, thus learning signal. Training batches are formed by mixing reused prompts with fresh samples, with cooldown steps and max reuse times controlling aggressiveness vs risk of overfitting. Across multiple model families (Llama-3.2- 3B, Qwen3-8B) and training datasets (Dolci, Polaris), evaluated using average accuracy on six standard math benchmarks, Prompt Replay reduces zero-variance prompts, increases mean absolute advantage and shows faster initial accuracy gains. Yet, it plateaus and converges with the baseline, as too aggressive configuration was used. The method is most efficient when the rollouts are the primary bottleneck and the dataset is difficult for the model. We additionally observe that Qwen2.5-Math can exhibit spurious-reward effects that invalidates ablations, raising a warning signal for using it as a sole testbed for GRPO method research.

Read abstractHide abstract

0

Mixture of Chapters: Scaling Learnt Memory in Transformers

cs.LG cs.AI cs.CL Tasmay Pankaj Tibrewal, Pritish Saha, Ankit Meda et al. · Mar 22, 2026

This paper tackles the lack of explicit memory mechanisms in transformers by introducing Mixture of Chapters (MoC)—a learned bank of 262K latent memory tokens accessed via cross-attention. To scale memory without prohibitive costs, the authors partition the bank into chapters and route each input sequence to a sparse subset (top-64), reducing complexity from $O(L \cdot N_m)$ to $O(L \cdot k \cdot T)$. The work demonstrates that explicit associative memory can serve as a new axis of scaling, showing improved knowledge retention when transitioning from pretraining to instruction fine-tuning.

Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that transformer layers query via cross-attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter-based routing inspired by Mixture-of-Experts architectures, partitioning the memory bank into chapters and training a router to select relevant subsets per input. This enables scaling to 262K memory tokens while maintaining tractable computation. We evaluate our approach against standard transformers (in iso-FLOP settings) on pre-training and instruction fine-tuning across relevant benchmarks. Our models surpass iso-FLOP baselines suggesting scope for a new axis of scaling, demonstrating that explicit associative memory provides complementary capacity to what is captured implicitly in model parameters. Additionally, we observe improved knowledge retention under continued training, with robustness to forgetting when transitioning between training phases (e.g., pretraining to instruction fine-tuning).

Read abstractHide abstract

0

Reward Sharpness-Aware Fine-Tuning for Diffusion Models

cs.LG cs.AI Kwanyoung Kim, Byeongsu Sim · Mar 22, 2026

This paper addresses reward hacking in reward-centric diffusion reinforcement learning (RDRL), where diffusion models exploit non-robust reward models to achieve high scores without actual perceptual quality improvements. The authors propose RSA-FT (Reward Sharpness-Aware Fine-Tuning), which mitigates hacking by flattening the reward landscape through joint perturbations in both image space (adversarial training) and parameter space (Sharpness-Aware Minimization). The method is plug-and-play, compatible with existing RDRL frameworks like ReFL and DRaFT, and shows consistent gains across SD1.5, SDXL, SD3, and Flux backbones.

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of reward-centric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arises from the non-robustness of reward model gradients, particularly when the reward landscape with respect to the input image is sharp. To mitigate this issue, we introduce methods that exploit gradients from a robustified reward model without requiring its retraining. Specifically, we employ gradients from a flattened reward model, obtained through parameter perturbations of the diffusion model and perturbations of its generated samples. Empirically, each method independently alleviates reward hacking and improves robustness, while their joint use amplifies these benefits. Our resulting framework, RSA-FT (Reward Sharpness-Aware Fine-Tuning), is simple, broadly compatible, and consistently enhances the reliability of RDRL.

Read abstractHide abstract

0

Rethinking Plasticity in Deep Reinforcement Learning

cs.LG cs.AI Zhiqiang He · Mar 22, 2026

This paper reframes plasticity loss in deep reinforcement learning as an optimization pathology rather than capacity degradation. The core claim—dubbed the Optimization-Centric Plasticity (OCP) hypothesis—is that parameters become trapped in local optima from previous tasks, which then become poor optima for new tasks. The authors prove that neuron dormancy is mathematically equivalent to zero-gradient states and show that plasticity recovers when tasks differ sufficiently, suggesting networks retain capacity but lose it to task-specific optimization landscapes.

This paper investigates the fundamental mechanisms driving plasticity loss in deep reinforcement learning (RL), a critical challenge where neural networks lose their ability to adapt to non-stationary environments. While existing research often relies on descriptive metrics like dormant neurons or effective rank, these summaries fail to explain the underlying optimization dynamics. We propose the Optimization-Centric Plasticity (OCP) hypothesis, which posits that plasticity loss arises because optimal points from previous tasks become poor local optima for new tasks, trapping parameters during task transitions and hindering subsequent learning. We theoretically establish the equivalence between neuron dormancy and zero-gradient states, demonstrating that the absence of gradient signals is the primary driver of dormancy. Our experiments reveal that plasticity loss is highly task-specific; notably, networks with high dormancy rates in one task can achieve performance parity with randomly initialized networks when switched to a significantly different task, suggesting that the network's capacity remains intact but is inhibited by the specific optimization landscape. Furthermore, our hypothesis elucidates why parameter constraints mitigate plasticity loss by preventing deep entrenchment in local optima. Validated across diverse non-stationary scenarios, our findings provide a rigorous optimization-based framework for understanding and restoring network plasticity in complex RL domains.

Read abstractHide abstract

0

Revisiting Tree Search for LLMs: Gumbel and Sequential Halving for Budget-Scalable Reasoning

cs.AI cs.LG Leonid Ugadiarov, Yuri Kuratov, Aleksandr Panov et al. · Mar 22, 2026

Standard AlphaZero-style tree search for LLM reasoning suffers from a scaling failure: on GSM8K and Game24, accuracy actually drops as the search budget increases beyond moderate levels. This paper introduces ReSCALE, which replaces PUCT selection and Dirichlet noise with Gumbel sampling and Sequential Halving—a best-arm identification technique from multi-armed bandits. The key insight is that root action-selection design is critical for budget-scalable reasoning without any changes to the model or its training.

Neural tree search is a powerful decision-making algorithm widely used in complex domains such as game playing and model-based reinforcement learning. Recent work has applied AlphaZero-style tree search to enhance the reasoning capabilities of Large Language Models (LLMs) during inference, but we find that this approach suffers from a scaling failure: on GSM8K and Game24, accuracy drops as the search budget increases. In this paper, we present ReSCALE, an adaptation of Gumbel AlphaZero MCTS that replaces Dirichlet noise and PUCT selection with Gumbel sampling and Sequential Halving, restoring monotonic scaling without changes to the model or its training. ReSCALE reaches 58.4\% on GSM8K and 85.3\% on Game24 at budgets where the baseline degrades. Ablations confirm that Sequential Halving is the primary driver of the improvement.

Read abstractHide abstract

0

ViCLSR: A Supervised Contrastive Learning Framework with Natural Language Inference for Natural Language Understanding Tasks

cs.CL cs.AI cs.LG Tin Van Huynh, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen · Mar 22, 2026

ViCLSR adapts supervised contrastive learning (SimCSE-style) to Vietnamese NLU by converting NLI entailment and contradiction pairs into positive and negative training signals. Built on XLM-R Large (550M), the framework improves sentence embeddings for low-resource Vietnamese, reporting gains of +6.97% F1 over PhoBERT on ViNLI and state-of-the-art results across five downstream tasks including fact-checking and machine reading comprehension.

High-quality text representations are crucial for natural language understanding (NLU), but low-resource languages like Vietnamese face challenges due to limited annotated data. While pre-trained models like PhoBERT and CafeBERT perform well, their effectiveness is constrained by data scarcity. Contrastive learning (CL) has recently emerged as a promising approach for improving sentence representations, enabling models to effectively distinguish between semantically similar and dissimilar sentences. We propose ViCLSR (Vietnamese Contrastive Learning for Sentence Representations), a novel supervised contrastive learning framework specifically designed to optimize sentence embeddings for Vietnamese, leveraging existing natural language inference (NLI) datasets. Additionally, we propose a process to adapt existing Vietnamese datasets for supervised learning, ensuring compatibility with CL methods. Our experiments demonstrate that ViCLSR significantly outperforms the powerful monolingual pre-trained model PhoBERT on five benchmark NLU datasets such as ViNLI (+6.97% F1), ViWikiFC (+4.97% F1), ViFactCheck (+9.02% F1), UIT-ViCTSD (+5.36% F1), and ViMMRC2.0 (+4.33% Accuracy). ViCLSR shows that supervised contrastive learning can effectively address resource limitations in Vietnamese NLU tasks and improve sentence representation learning for low-resource languages. Furthermore, we conduct an in-depth analysis of the experimental results to uncover the factors contributing to the superior performance of contrastive learning models. ViCLSR is released for research purposes in advancing natural language processing tasks.

Read abstractHide abstract

0

Sonny: Breaking the Compute Wall in Medium-Range Weather Forecasting

cs.LG cs.AI cs.CV Minjong Cheon · Mar 22, 2026

Sonny tackles the compute barrier in medium-range weather forecasting by proposing a hierarchical transformer that trains on a single A40 GPU in 5.5 days. The core idea is a two-stage StepsNet pipeline: a narrow 'slow path' processes large-scale dynamics (U,V,Z,P) first, then a full-width 'fast path' integrates thermodynamics (T,Q). Combined with EMA during training, randomized dynamics forecasting, and pressure-weighted losses, Sonny aims to deliver competitive forecast skill without the TPU/GPU cluster requirements of models like Pangu-Weather or GraphCast.

Weather forecasting is a fundamental problem for protecting lives and infrastructure from high-impact atmospheric events. Recently, data-driven weather forecasting methods based on deep learning have demonstrated strong performance, often reaching accuracy levels competitive with operational numerical systems. However, many existing models rely on large-scale training regimes and compute-intensive architectures, which raises the practical barrier for academic groups with limited compute resources. Here we introduce Sonny, an efficient hierarchical transformer that achieves competitive medium-range forecasting performance while remaining feasible within reasonable compute budgets. At the core of Sonny is a two-stage StepsNet design: a narrow slow path first models large-scale atmospheric dynamics, and a subsequent full-width fast path integrates thermodynamic interactions. To stabilize medium-range rollout without an additional fine-tuning stage, we apply exponential moving average (EMA) during training. On WeatherBench2, Sonny yields robust medium-range forecast skill, remains competitive with operational baselines, and demonstrates clear advantages over FastNet, particularly at extended tropical lead times. In practice, Sonny can be trained to convergence on a single NVIDIA A40 GPU in approximately 5.5 days.

Read abstractHide abstract

0

Fusing Memory and Attention: A study on LSTM, Transformer and Hybrid Architectures for Symbolic Music Generation

cs.LG cs.AI cs.SD Soudeep Ghoshal, Sandipan Chakraborty, Pradipto Chowdhury et al. · Mar 22, 2026

This paper tackles the tension between local melodic continuity and global structural coherence in symbolic music generation. It proposes a hybrid architecture fusing a Transformer encoder (for global patterns) with an LSTM decoder (for temporal precision), evaluating it against pure LSTM and Transformer baselines using 17 musical quality metrics on 1,000 generated melodies per model. The work matters because it provides systematic evidence that architectural hybridization can reconcile the complementary strengths of memory-based and attention-based models.

Machine learning techniques, such as Transformers and Long Short-Term Memory (LSTM) networks, play a crucial role in Symbolic Music Generation (SMG). Existing literature indicates a difference between LSTMs and Transformers regarding their ability to model local melodic continuity versus maintaining global structural coherence. However, their specific properties within the context of SMG have not been systematically studied. This paper addresses this gap by providing a fine-grained comparative analysis of LSTMs versus Transformers for SMG, examining local and global properties in detail using 17 musical quality metrics on the Deutschl dataset. We find that LSTM networks excel at capturing local patterns but fail to preserve long-range dependencies, while Transformers model global structure effectively but tend to produce irregular phrasing. Based on this analysis and leveraging their respective strengths, we propose a Hybrid architecture combining a Transformer Encoder with an LSTM Decoder and evaluate it against both baselines. We evaluated 1,000 generated melodies from each of the three architectures on the Deutschl dataset. The results show that the hybrid method achieves better local and global continuity and coherence compared to the baselines. Our work highlights the key characteristics of these models and demonstrates how their properties can be leveraged to design superior models. We also supported the experiments with ablation studies and human perceptual evaluations, which statistically support the findings and provide robust validation for this work.

Read abstractHide abstract

0

Aggregation Alignment for Federated Learning with Mixture-of-Experts under Data Heterogeneity

cs.LG cs.AI Zihan Fang, Qianru Wang, Haonan An et al. · Mar 22, 2026

The paper addresses federated fine-tuning of Mixture-of-Experts (MoE) based large language models under non-IID data distributions, where direct parameter aggregation causes gating preference misalignment and expert semantic blurring. The proposed FedAlign-MoE framework introduces consistency-based gating distribution alignment using routing consistency weighting ($\omega_i(e) = s_i(e)/\sum_j s_j(e)$) and semantic-aware expert aggregation via region-conditioned gated weights ($\gamma_{i,j}(e)$). This matters because MoE architectures are increasingly vital for scaling LLMs efficiently, yet data heterogeneity across federated clients undermines their specialization benefits.

Large language models (LLMs) increasingly adopt Mixture-of-Experts (MoE) architectures to scale model capacity while reducing computation. Fine-tuning these MoE-based LLMs often requires access to distributed and privacy-sensitive data, making centralized fine-tuning impractical. Federated learning (FL) therefore provides a paradigm to collaboratively fine-tune MoE-based LLMs, enabling each client to integrate diverse knowledge without compromising data privacy. However, the integration of MoE-based LLM fine-tuning into FL encounters two critical aggregation challenges due to inherent data heterogeneity across clients: (i) divergent local data distributions drive clients to develop distinct gating preference for localized expert selection, causing direct parameter aggregation to produce a ``one-size-fits-none'' global gating network, and (ii) same-indexed experts develop disparate semantic roles across clients, leading to expert semantic blurring and the degradation of expert specialization. To address these challenges, we propose FedAlign-MoE, a federated aggregation alignment framework that jointly enforces routing consistency and expert semantic alignment. Specifically, FedAlign-MoE aggregates gating behaviors by aligning routing distributions through consistency weighting and optimizes local gating networks through distribution regularization, maintaining cross-client stability without overriding discriminative local preferences. Meanwhile, FedAlign-MoE explicitly quantifies semantic consistency among same-indexed experts across clients and selectively aggregates updates from semantically aligned clients, ensuring stable and specialized functional roles for global experts. Extensive experiments demonstrate that FedAlign-MoE outperforms state-of-the-art benchmarks, achieving faster convergence and superior accuracy in non-IID federated environments.

Read abstractHide abstract

0

The Library Theorem: How External Organization Governs Agentic Reasoning Capacity

cs.AI cs.CL cs.DS Zachary F. Mainen · Mar 22, 2026

This paper formalizes the transformer context window as an I/O page and proves that tool-augmented agents with indexed external memory achieve exponential retrieval cost savings over sequential scanning: $\mathcal{O}(\log_b N)$ versus $\Omega(N)$ page reads. The authors validate these predictions experimentally across three content types and identify "parametric memory competition" as a failure mode where models bypass retrieval protocols for familiar content.

Externalized reasoning is already exploited by transformer-based agents through chain-of-thought, but structured retrieval -- indexing over one's own reasoning state -- remains underexplored. We formalize the transformer context window as an I/O page and prove that tool-augmented agents with indexed external memory achieve exponentially lower retrieval cost than agents restricted to sequential scanning: $O(\log_b N)$ versus $\Omega(N)$ page reads per query, and $O(T \log_b T)$ versus $\Theta(T^2)$ cumulative cost over $T$ reasoning steps -- a gap that widens as deliberation deepens. We test these predictions on a controlled lookup benchmark across three content types -- random hashes, ordered integers, and encyclopedia entries -- varying store size from 50 to 5,000 items, and replicate key conditions across two model generations (GPT-4o-mini and GPT-5.4). On abstract content, the indexed agent achieves median 1 page read regardless of store size, confirming the $O(1)$ prediction. Sorted pages without an index fail to close the gap: the weaker model cannot sustain binary search at scale, and the stronger model achieves near-optimal $\log_2 N$ search but still loses to the index by $5\times$. On familiar content (encyclopedia entries), a competing failure mode emerges: the model recognizes the domain, bypasses the retrieval protocol, and generates answers from parametric memory, producing catastrophic token expenditure even when the index is sound. This parametric memory competition dissociates the two cognitive operations that indexing combines: understanding content (where language models excel) and following navigational protocols (where they fail when understanding tempts them to shortcut). The result argues for a separation of concerns: use language models for index construction, where semantic understanding helps, and deterministic algorithms for index traversal, where it hurts.

Read abstractHide abstract

0

NeSy-Edge: Neuro-Symbolic Trustworthy Self-Healing in the Computing Continuum

cs.DC cs.AI cs.LG Peihan Ye, Alfreds Lapkovskis, Alaa Saleh et al. · Mar 22, 2026

Modern AI services increasingly run across the computing continuum—from cloud to edge devices—yet fault management remains challenging due to resource constraints, noisy telemetry, and cascading failures. This paper proposes NeSy-Edge, a three-layer neuro-symbolic framework that performs local log parsing, causal graph construction, and root-cause analysis on edge nodes, invoking cloud LLMs only when local evidence is insufficient. The core idea is to combine lightweight symbolic caching and prior-constrained causal discovery with selective neural inference, trading off autonomy against accuracy under strict memory budgets ($\sim$1500 MB).

The computational demands of modern AI services are increasingly shifting execution beyond centralized clouds toward a computing continuum spanning edge and end devices. However, the scale, heterogeneity, and cross-layer dependencies of these environments make resilience difficult to maintain. Existing fault-management methods are often too static, fragmented, or heavy to support timely self-healing, especially under noisy logs and edge resource constraints. To address these limitations, this paper presents NeSy-Edge, a neuro-symbolic framework for trustworthy self-healing in the computing continuum. The framework follows an edge-first design, where a resource-constrained edge node performs local perception and reasoning, while a cloud model is invoked only at the final diagnosis stage. Specifically, NeSy-Edge converts raw runtime logs into structured event representations, builds a prior-constrained sparse symbolic causal graph, and integrates causal evidence with historical troubleshooting knowledge for root-cause analysis and recovery recommendation. We evaluate our work on representative Loghub datasets under multiple levels of semantic noise, considering parsing quality, causal reasoning, end-to-end diagnosis, and edge-side resource usage. The results show that NeSy-Edge remains robust even at the highest noise level, achieving up to 75% root-cause analysis accuracy and 65% end-to-end accuracy while operating within about 1500 MB of local memory.

Read abstractHide abstract

0

Harmful Visual Content Manipulation Matters in Misinformation Detection Under Multimedia Scenarios

cs.LG cs.AI cs.MM Bing Wang, Ximing Li, Changchun Li et al. · Mar 22, 2026

This paper tackles multimodal misinformation detection by distinguishing between harmful and harmless visual content manipulation—a nuance often overlooked by existing methods. The authors propose Havc-m4d, a framework that extracts manipulation and intention features using weakly-supervised positive-unlabeled (PU) learning to overcome the lack of ground-truth manipulation labels. By treating real articles with manipulated visuals as likely harmless and fake articles as potentially harmful, the method introduces intention-aware cues that consistently improve detection across four benchmark datasets.

Nowadays, the widespread dissemination of misinformation across numerous social media platforms has led to severe negative effects on society. To address this challenge, the automatic detection of misinformation, particularly under multimedia scenarios, has gained significant attention from both academic and industrial communities, leading to the emergence of a research task known as Multimodal Misinformation Detection (MMD). Typically, current MMD approaches focus on capturing the semantic relationships and inconsistency between various modalities but often overlook certain critical indicators within multimodal content. Recent research has shown that manipulated features within visual content in social media articles serve as valuable clues for MMD. Meanwhile, we argue that the potential intentions behind the manipulation, e.g., harmful and harmless, also matter in MMD. Therefore, in this study, we aim to identify such multimodal misinformation by capturing two types of features: manipulation features, which represent if visual content has been manipulated, and intention features, which assess the nature of these manipulations, distinguishing between harmful and harmless intentions. Unfortunately, the manipulation and intention labels that supervise these features to be discriminative are unknown. To address this, we introduce two weakly supervised indicators as substitutes by incorporating supplementary datasets focused on image manipulation detection and framing two different classification tasks as positive and unlabeled learning issues. With this framework, we introduce an innovative MMD approach, titled Harmful Visual Content Manipulation Matters in MMD (HAVC-M4 D). Comprehensive experiments conducted on four prevalent MMD datasets indicate that HAVC-M4 D significantly and consistently enhances the performance of existing MMD methods.

Read abstractHide abstract

0

Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

cs.CL cs.AI cs.LG Jinquan Zheng, Jia Yuan, Jiacheng Yao et al. · Mar 22, 2026

This paper addresses selection bias (position and label bias) in large language models during discrete-choice tasks like multiple-choice questions and pairwise evaluation. The authors propose Permutation-Aware GRPO (PA-GRPO), which extends Group Relative Policy Optimization by treating different permutations of the same question as a single training group rather than independent instances. The method enforces semantic consistency across permutations through two mechanisms: a cross-permutation advantage that computes rewards relative to the group mean, and a consistency-aware reward that penalizes disagreement across permutations. Experiments across seven benchmarks and three models (Llama-3.1-8B, Qwen3-8B, Qwen3-32B) demonstrate that PA-GRPO reduces selection bias while maintaining accuracy.

Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU-Text-Computing/PA-GRPO).

Read abstractHide abstract

0

The Intelligent Disobedience Game: Formulating Disobedience in Stackelberg Games and Markov Decision Processes

cs.AI cs.GT cs.LG Benedikt Hornig, Reuth Mirsky · Mar 22, 2026

This paper addresses the challenge of "intelligent disobedience" in shared autonomy — when assistive AI must override human commands to prevent harm but remain helpful. The authors formalize this as the Intelligent Disobedience Game (IDG), a sequential Stackelberg game where a human leader proposes actions and an assistive follower with superior environmental awareness decides whether to obey or intervene. The framework aims to provide the mathematical foundations for training safety-critical assistive systems.

In shared autonomy, a critical tension arises when an automated assistant must choose between obeying a human's instruction and deliberately overriding it to prevent harm. This safety-critical behavior is known as intelligent disobedience. To formalize this dynamic, this paper introduces the Intelligent Disobedience Game (IDG), a sequential game-theoretic framework based on Stackelberg games that models the interaction between a human leader and an assistive follower operating under asymmetric information. It characterizes optimal strategies for both agents across multi-step scenarios, identifying strategic phenomena such as ``safety traps,'' where the system indefinitely avoids harm but fails to achieve the human's goal. The IDG provides a needed mathematical foundation that enables both the algorithmic development of agents that can learn safe non-compliance and the empirical study of how humans perceive and trust disobedient AI. The paper further translates the IDG into a shared control Multi-Agent Markov Decision Process representation, forming a compact computational testbed for training reinforcement learning agents.

Read abstractHide abstract

0

Long-Term Outlier Prediction Through Outlier Score Modeling

cs.LG cs.AI Yuma Aoki, Joon Park, Koh Takeuchi et al. · Mar 22, 2026

This paper addresses the problem of forecasting outlier events far in advance in time series data, rather than merely detecting immediate anomalies. The authors propose a two-layer framework that first computes outlier scores using standard detection methods, then models the temporal structure of these scores to predict future anomalies. By assuming that outlier occurrences exhibit temporal patterns (e.g., periodicity or delayed dependencies), the method aims to forecast outlier likelihoods without requiring future observations.

This study addresses an important gap in time series outlier detection by proposing a novel problem setting: long-term outlier prediction. Conventional methods primarily focus on immediate detection by identifying deviations from normal patterns. As a result, their applicability is limited when forecasting outlier events far into the future. To overcome this limitation, we propose a simple and unsupervised two-layer method that is independent of specific models. The first layer performs standard outlier detection, and the second layer predicts future outlier scores based on the temporal structure of previously observed outliers. This framework enables not only pointwise detection but also long-term forecasting of outlier likelihoods. Experiments on synthetic datasets show that the proposed method performs well in both detection and prediction tasks. These findings suggest that the method can serve as a strong baseline for future work in outlier detection and forecasting.

Read abstractHide abstract

0

Structural Sensitivity in Compressed Transformers: Error Propagation, Lyapunov Stability, and Formally Verified Bounds

cs.LG cs.AI cs.CL Abhinaba Basu · Mar 22, 2026

This paper investigates why compressing different weight matrices in transformers leads to wildly different outcomes—from negligible impact to 20,000× perplexity increases. The authors map this structural sensitivity across five architectures, revealing that early-layer MLP up-projections are catastrophically fragile while value projections are nearly free to compress. Using Lyapunov stability theory, they explain how residual connections contract errors, and they provide machine-checked formal bounds in Lean 4 to guarantee per-matrix approximation quality.

A single matrix out of 468 in GPT-2 Small can increase perplexity by 20,000x when compressed, revealing that transformer compression sensitivity spans five orders of magnitude. We map this sensitivity landscape across five architectures (117M-8B parameters), finding a consistent hierarchy: early-layer MLP up-projections are catastrophically sensitive while value projections compress nearly for free. This hierarchy is stable across compression levels, evaluation scales (2K-51K tokens), and datasets (WikiText-103, C4). Using Lyapunov stability theory, we show that residual connections contract compression errors by growing the hidden state faster than the error. Error contraction is necessary but not sufficient for compression tolerance: architecture-specific redundancy plays an equally important role, as demonstrated by the hybrid LFM2-2.6B degrading only 7x despite higher amplification than the fully-contracting GPT-2 Small (120x). Ten machine-checked Lean 4 theorems formalize per-matrix error bounds with no sorry markers; all bounds produce zero violations across 14,040+ configurations. We validate with downstream task evaluation (HellaSwag, ARC-Easy, Winogrande), activation-aware pruning on two architectures, and a Compression Fragility Index that rank-orders model robustness.

Read abstractHide abstract

Nothing here yet