Feed - arxlens

0

INTRYGUE: Induction-Aware Entropy Gating for Reliable RAG Uncertainty Estimation

cs.AI Alexandra Bazarova, Andrei Volodichev, Daria Kotova et al. · Mar 23, 2026

RAG improves factual reliability but doesn't eliminate hallucinations. The paper reveals a mechanistic paradox: induction heads that copy correct answers from context simultaneously trigger entropy neurons that suppress confidence, causing entropy-based uncertainty signals to fail. INTRYGUE gates predictive entropy using induction head activation (SinkRate) to correct this inflation, offering a training-free method for reliable RAG hallucination detection.

While retrieval-augmented generation (RAG) significantly improves the factual reliability of LLMs, it does not eliminate hallucinations, so robust uncertainty quantification (UQ) remains essential. In this paper, we reveal that standard entropy-based UQ methods often fail in RAG settings due to a mechanistic paradox. An internal "tug-of-war" inherent to context utilization appears: while induction heads promote grounded responses by copying the correct answer, they collaterally trigger the previously established "entropy neurons". This interaction inflates predictive entropy, causing the model to signal false uncertainty on accurate outputs. To address this, we propose INTRYGUE (Induction-Aware Entropy Gating for Uncertainty Estimation), a mechanistically grounded method that gates predictive entropy based on the activation patterns of induction heads. Evaluated across four RAG benchmarks and six open-source LLMs (4B to 13B parameters), INTRYGUE consistently matches or outperforms a wide range of UQ baselines. Our findings demonstrate that hallucination detection in RAG benefits from combining predictive uncertainty with interpretable, internal signals of context utilization.

Read abstractHide abstract

0

Is the future of AI green? What can innovation diffusion models say about generative AI's environmental impact?

cs.AI Robert Viseur, Nicolas Jullien · Mar 22, 2026

This paper applies the classic Abernathy-Utterback (A-U) innovation diffusion model to generative AI's environmental impact. The authors argue that alarmist predictions about GAI's carbon footprint often ignore how innovation diffusion drives process optimization and efficiency gains. They forecast that the GAI industry is transitioning from the 'fluid' A-U:1 phase to the 'transitional' A-U:2 phase, where dominant designs will emerge. The paper predicts two main business models: large generalist platforms serving mass audiences, and smaller specialized models targeting specific use cases. Their core argument is that GAI 'will never be green, but its impact may not be as problematic as is sometimes claimed' depending on which business model dominates.

The rise of generative artificial intelligence (GAI) has led to alarming predictions about its environmental impact. However, these predictions often overlook the fact that the diffusion of innovation is accompanied by the evolution of products and the optimization of their performance, primarily for economic reasons. This can also reduce their environmental impact. By analyzing the GAI ecosystem using the classic A-U innovation diffusion model, we can forecast this industry's structure and how its environmental impact will evolve. While GAI will never be green, its impact may not be as problematic as is sometimes claimed. However, this depends on which business model becomes dominant.

Read abstractHide abstract

0

BOxCrete: A Bayesian Optimization Open-Source AI Model for Concrete Strength Forecasting and Mix Optimization

cs.LG cs.AI Bayezid Baten, M. Ayyan Iqbal, Sebastian Ament et al. · Mar 23, 2026

Concrete mix design requires balancing competing objectives of mechanical strength and sustainability. BOxCrete introduces a Gaussian Process regression framework trained on 533 strength measurements from 123 unique mixtures to predict compressive strength evolution over curing time and optimize mixes for embodied carbon using multi-objective Bayesian Optimization. The work addresses a critical gap in the literature by providing an open-source alternative to proprietary industrial datasets and models.

Modern concrete must simultaneously satisfy evolving demands for mechanical performance, workability, durability, and sustainability, making mix designs increasingly complex. Recent studies leveraging Artificial Intelligence (AI) and Machine Learning (ML) models show promise for predicting compressive strength and guiding mix optimization, but most existing efforts are based on proprietary industrial datasets and closed-source implementations. Here we introduce BOxCrete, an open-source probabilistic modeling and optimization framework trained on a new open-access dataset of over 500 strength measurements (1-15 ksi) from 123 mixtures - 69 mortar and 54 concrete mixes tested at five curing ages (1, 3, 5, 14, and 28 days). BOxCrete leverages Gaussian Process (GP) regression to predict strength development, achieving average R$^2$ = 0.94 and RMSE = 0.69 ksi, quantify uncertainty, and carry out multi-objective optimization of compressive strength and embodied carbon. The dataset and model establish a reproducible open-source foundation for data-driven development of AI-based optimized mix designs.

Read abstractHide abstract

0

Efficient Fine-Tuning Methods for Portuguese Question Answering: A Comparative Study of PEFT on BERTimbau and Exploratory Evaluation of Generative LLMs

cs.CL cs.AI cs.LG Mariela M. Nina, Caio Veloso Costa, Lilian Berton et al. · Mar 22, 2026

This paper addresses computational barriers for Brazilian Portuguese question answering by systematically evaluating Parameter-Efficient Fine-Tuning (PEFT) methods on BERTimbau models using the SQuAD-BR dataset. The authors test LoRA, DoRA, QLoRA, and QDoRA across Base (110M) and Large (335M) variants, demonstrating that LoRA achieves 95.8% of full fine-tuning performance while reducing training time by 73.5%. A key finding is that PEFT methods require substantially higher learning rates ($2\times 10^{-4}$) than standard BERT fine-tuning to achieve optimal results, with quantization resilience favoring larger models.

Although large language models have transformed natural language processing, their computational costs create accessibility barriers for low-resource languages such as Brazilian Portuguese. This work presents a systematic evaluation of Parameter-Efficient Fine-Tuning (PEFT) and quantization techniques applied to BERTimbau for Question Answering on SQuAD-BR, the Brazilian Portuguese translation of SQuAD v1. We evaluate 40 configurations combining four PEFT methods (LoRA, DoRA, QLoRA, QDoRA) across two model sizes (Base: 110M, Large: 335M parameters). Our findings reveal three critical insights: (1) LoRA achieves 95.8\% of baseline performance on BERTimbau-Large while reducing training time by 73.5\% (F1=81.32 vs 84.86); (2) higher learning rates (2e-4) substantially improve PEFT performance, with F1 gains of up to +19.71 points over standard rates; and (3) larger models show twice the quantization resilience (loss of 4.83 vs 9.56 F1 points). These results demonstrate that encoder-based models can be efficiently fine-tuned for extractive Brazilian Portuguese QA with substantially lower computational cost than large generative LLMs, promoting more sustainable approaches aligned with \textit{Green AI} principles. An exploratory evaluation of Tucano and Sabi\'a on the same extractive QA benchmark shows that while generative models can reach competitive F1 scores with LoRA fine-tuning, they require up to 4.2$\times$ more GPU memory and 3$\times$ more training time than BERTimbau-Base, reinforcing the efficiency advantage of smaller encoder-based architectures for this task.

Read abstractHide abstract

0

Benchmarking Bengali Dialectal Bias: A Multi-Stage Framework Integrating RAG-Based Translation and Human-Augmented RLAIF

cs.CL cs.AI cs.CY K. M. Jubair Sami, Dipto Sumit, Ariyan Hossain et al. · Mar 22, 2026

This paper tackles the problem of measuring dialectal bias in LLMs for Bengali, a low-resource language with nine major regional variants. The authors propose a two-phase framework combining RAG-based translation to create dialectal benchmarks with an RLAIF-inspired evaluation protocol that uses CoT-first reasoning and multi-judge validation. They expose the catastrophic failure of traditional metrics like BLEU and WER for agglutinative dialectal Bengali, showing that LLM-as-judge better predicts human quality assessments.

Large language models (LLMs) frequently exhibit performance biases against regional dialects of low-resource languages. However, frameworks to quantify these disparities remain scarce. We propose a two-phase framework to evaluate dialectal bias in LLM question-answering across nine Bengali dialects. First, we translate and gold-label standard Bengali questions into dialectal variants adopting a retrieval-augmented generation (RAG) pipeline to prepare 4,000 question sets. Since traditional translation quality evaluation metrics fail on unstandardized dialects, we evaluate fidelity using an LLM-as-a-judge, which human correlation confirms outperforms legacy metrics. Second, we benchmark 19 LLMs across these gold-labeled sets, running 68,395 RLAIF evaluations validated through multi-judge agreement and human fallback. Our findings reveal severe performance drops linked to linguistic divergence. For instance, responses to the highly divergent Chittagong dialect score 5.44/10, compared to 7.68/10 for Tangail. Furthermore, increased model scale does not consistently mitigate this bias. We contribute a validated translation quality evaluation method, a rigorous benchmark dataset, and a Critical Bias Sensitivity (CBS) metric for safety-critical applications.

Read abstractHide abstract

0

CataractSAM-2: A Domain-Adapted Model for Anterior Segment Surgery Segmentation and Scalable Ground-Truth Annotation

cs.CV cs.AI cs.DB Mohammad Eslami, Dhanvinkumar Ganeshkumar, Saber Kazeminasab et al. · Mar 23, 2026

CataractSAM-2 adapts Meta's Segment Anything Model 2 (SAM-2) for real-time semantic segmentation in cataract surgery videos. The core idea is to fine-tune only the prompt encoder and mask decoder while freezing the image encoder, enabling precise segmentation of anatomical structures and surgical instruments under challenging conditions like glare and occlusion. The paper also introduces an interactive annotation framework that propagates sparse user prompts across video frames to accelerate ground-truth generation.

We present CataractSAM-2, a domain-adapted extension of Meta's Segment Anything Model 2, designed for real-time semantic segmentation of cataract ophthalmic surgery videos with high accuracy. Positioned at the intersection of computer vision and medical robotics, CataractSAM-2 enables precise intraoperative perception crucial for robotic-assisted and computer-guided surgical systems. Furthermore, to alleviate the burden of manual labeling, we introduce an interactive annotation framework that combines sparse prompts with video-based mask propagation. This tool significantly reduces annotation time and facilitates the scalable creation of high-quality ground-truth masks, accelerating dataset development for ocular anterior segment surgeries. We also demonstrate the model's strong zero-shot generalization to glaucoma trabeculectomy procedures, confirming its cross-procedural utility and potential for broader surgical applications. The trained model and annotation toolkit are released as open-source resources, establishing CataractSAM-2 as a foundation for expanding anterior ophthalmic surgical datasets and advancing real-time AI-driven solutions in medical robotics, as well as surgical video understanding.

Read abstractHide abstract

0

CatRAG: Functor-Guided Structural Debiasing with Retrieval Augmentation for Fair LLMs

cs.CL cs.AI Ravi Ranjan, Utkarsh Grover, Mayur Akewar et al. · Mar 23, 2026

Large Language Models often inherit societal biases that manifest as stereotyped associations across demographic groups. This paper proposes CatRAG, a dual-mechanism debiasing framework that combines a category-theoretic functor-guided projection—collapsing protected-attribute directions in embedding space via spectral decomposition—with diversity-aware Retrieval-Augmented Generation to ground inference in balanced evidence. Evaluated on the BBQ benchmark across Llama-3, GPT-OSS, and Gemma-3, the method claims to reduce bias scores from ~60% to near zero while improving accuracy by up to 40% over base models.

Large Language Models (LLMs) are deployed in high-stakes settings but can show demographic, gender, and geographic biases that undermine fairness and trust. Prior debiasing methods, including embedding-space projections, prompt-based steering, and causal interventions, often act at a single stage of the pipeline, resulting in incomplete mitigation and brittle utility trade-offs under distribution shifts. We propose CatRAG Debiasing, a dual-pronged framework that integrates functor with Retrieval-Augmented Generation (RAG) guided structural debiasing. The functor component leverages category-theoretic structure to induce a principled, structure-preserving projection that suppresses bias-associated directions in the embedding space while retaining task-relevant semantics. On the Bias Benchmark for Question Answering (BBQ) across three open-source LLMs (Meta Llama-3, OpenAI GPT-OSS, and Google Gemma-3), CatRAG achieves state-of-the-art results, improving accuracy by up to 40% over the corresponding base models and by more than 10% over prior debiasing methods, while reducing bias scores to near zero (from 60% for the base models) across gender, nationality, race, and intersectional subgroups.

Read abstractHide abstract

0

mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT

cs.LG cs.AI Woosung Koh, Jeyoung Jeon, Youngjin Song et al. · Mar 23, 2026

The paper tackles the inefficiency of homogeneous compute allocation in multi-task supervised fine-tuning (SFT), where fast-learning tasks overfit while slow ones remain under-trained. The authors propose mSFT, an iterative algorithm that dynamically excludes overfitting sub-datasets and reverts to optimal checkpoints. Their approach consistently outperforms baselines across 6 models and 10 benchmarks, sometimes reducing compute while improving accuracy.

Current language model training commonly applies multi-task Supervised Fine-Tuning (SFT) using a homogeneous compute budget across all sub-datasets. This approach is fundamentally sub-optimal: heterogeneous learning dynamics cause faster-learning tasks to overfit early while slower ones remain under-fitted. To address this, we introduce mSFT, an iterative, overfitting-aware search algorithm for multi-task data mixtures. mSFT trains the model on an active mixture, identifies and excludes the earliest overfitting sub-dataset, and reverts to that specific optimal checkpoint before continuing. Extensive evaluations demonstrate that mSFT consistently outperforms 4 baselines across 10 benchmarks and 6 base models. Further analysis confirms mSFT maintains robust gains across diverse dataset sizes, task granularities, and is insensitive to its single new hyperparameter (compute budget). Notably, at low compute budget, mSFT can improve performance while lowering training FLOPs. Ultimately, mSFT establishes a practical overfitting-aware algorithm for multi-task SFT that maximizes the potential of models across diverse data mixtures.

Read abstractHide abstract

0

Unified-MAS: Universally Generating Domain-Specific Nodes for Empowering Automatic Multi-Agent Systems

cs.AI Hehai Lin, Yu Yan, Zixuan Wang et al. · Mar 23, 2026

Unified-MAS tackles a critical failure mode in automatic Multi-Agent Systems: their severe performance degradation in knowledge-intensive domains like healthcare and law, where general-purpose reasoning nodes fall short. The core innovation decouples granular node implementation from topological orchestration through an offline two-stage pipeline that synthesizes domain-specific agent nodes via external knowledge retrieval and refines them using a perplexity-guided reward signal. This paradigm matters because it promises to catapult general-purpose Auto-MAS to expert-level performance without costly manual engineering of domain-specific agents.

Automatic Multi-Agent Systems (MAS) generation has emerged as a promising paradigm for solving complex reasoning tasks. However, existing frameworks are fundamentally bottlenecked when applied to knowledge-intensive domains (e.g., healthcare and law). They either rely on a static library of general nodes like Chain-of-Thought, which lack specialized expertise, or attempt to generate nodes on the fly. In the latter case, the orchestrator is not only bound by its internal knowledge limits but must also simultaneously generate domain-specific logic and optimize high-level topology, leading to a severe architectural coupling that degrades overall system efficacy. To bridge this gap, we propose Unified-MAS that decouples granular node implementation from topological orchestration via offline node synthesis. Unified-MAS operates in two stages: (1) Search-Based Node Generation retrieves external open-world knowledge to synthesize specialized node blueprints, overcoming the internal knowledge limits of LLMs; and (2) Reward-Based Node Optimization utilizes a perplexity-guided reward to iteratively enhance the internal logic of bottleneck nodes. Extensive experiments across four specialized domains demonstrate that integrating Unified-MAS into four Automatic-MAS baselines yields a better performance-cost trade-off, achieving up to a 14.2% gain while significantly reducing costs. Further analysis reveals its robustness across different designer LLMs and its effectiveness on conventional tasks such as mathematical reasoning.

Read abstractHide abstract

0

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

cs.AI cs.CL Liang Ding · Mar 22, 2026

AgentHER tackles the data waste problem in LLM agent training by adapting Hindsight Experience Replay (HER) from RL to natural-language trajectories. The core insight is that failed trajectories—typically 60–75% of collected data—often represent valid demonstrations for achievable alternative goals. The paper proposes a four-stage pipeline with multi-judge verification that converts discarded failures into SFT and DPO training data, yielding +7.1–11.7 pp gains over success-only fine-tuning across four model families on WebArena and ToolBench.

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely discarded, wasting the dominant source of collected experience. We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation. The key insight is simple: a trajectory that fails goal A is often a correct demonstration for some achievable alternative goal B. AgentHER realises this idea through a four-stage pipeline -- failure classification, outcome extraction, LLM-guided prompt relabeling with confidence gating, and data packaging -- that converts discarded failures into high-quality SFT, DPO, and ShareGPT training data, with both zero-cost rule-based and LLM-judge implementations. On WebArena (Zhou et al., 2024) and ToolBench (Qin et al., 2024), AgentHER improves over success-only SFT by +7.1-11.7 pp across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), while achieving 2x data efficiency -- matching baseline performance with only 50% of successful demonstrations. Gains are consistent from 1.5B to 72B parameters (+5.8-9.2 pp) and compound under iterative redeployment (+2.1 pp over additional rounds). Human evaluation confirms 97.7% relabeling precision under multi-judge verification.

Read abstractHide abstract

0

The AI Scientific Community: Agentic Virtual Lab Swarms

cs.AI Ulisses Braga-Neto · Mar 22, 2026

This paper proposes a conceptual framework for AI-driven scientific discovery by treating swarms of autonomous virtual laboratories as particles in a particle swarm optimization (PSO) system. Each virtual lab—comprising LLM-based agents for planning, experimentation, and review—operates as an independent research unit that interacts with others through citation-analogous voting mechanisms. The central idea is to simulate the emergent dynamics of real scientific communities (exploration-exploitation balance, paradigm formation, natural selection of ideas) without a central coordinator. The work matters because current single-agent systems like The AI Scientist may lack the diversity and error-correction mechanisms that make human science robust.

In this short note we propose using agentic swarms of virtual labs as a model of an AI Science Community. In this paradigm, each particle in the swarm represents a complete virtual laboratory instance, enabling collective scientific exploration that mirrors real-world research communities. The framework leverages the inherent properties of swarm intelligence - decentralized coordination, balanced exploration-exploitation trade-offs, and emergent collective behavior - to simulate the behavior of a scientific community and potentially accelerate scientific discovery. We discuss architectural considerations, inter-laboratory communication and influence mechanisms including citation-analogous voting systems, fitness function design for quantifying scientific success, anticipated emergent behaviors, mechanisms for preventing lab dominance and preserving diversity, and computational efficiency strategies to enable large swarms exhibiting complex emergent behavior analogous to real-world scientific communities. A working instance of the AI Science Community is currently under development.

Read abstractHide abstract

0

Beyond Correlation: Refutation-Validated Aspect-Based Sentiment Analysis for Explainable Energy Market Returns

cs.AI cs.CL cs.LG Wihan van der Heever, Keane Ong, Ranjan Satapathy et al. · Mar 23, 2026

This paper addresses the fundamental problem that correlational sentiment analysis cannot distinguish genuine economic associations from spurious statistical artifacts in financial markets. The core contribution is a refutation-validated framework for aspect-based sentiment analysis that combines net-ratio sentiment scoring with four robustness tests—placebo, random common cause, subset stability, and bootstrap validation—to filter false discoveries in high-dimensional sentiment-return analysis. This matters because investment strategies built on spurious correlations can lead to systematic losses, and regulators increasingly demand explainable AI systems with auditable validation.

This paper proposes a refutation-validated framework for aspect-based sentiment analysis in financial markets, addressing the limitations of correlational studies that cannot distinguish genuine associations from spurious ones. Using X data for the energy sector, we test whether aspect-level sentiment signals show robust, refutation-validated relationships with equity returns. Our pipeline combines net-ratio scoring with z-normalization, OLS with Newey West HAC errors, and refutation tests including placebo, random common cause, subset stability, and bootstrap. Across six energy tickers, only a few associations survive all checks, while renewables show aspect and horizon specific responses. While not establishing causality, the framework provides statistically robust, directionally interpretable signals, with limited sample size (six stocks, one quarter) constraining generalizability and framing this work as a methodological proof of concept.

Read abstractHide abstract

0

Silent Commitment Failure in Instruction-Tuned Language Models: Evidence of Governability Divergence Across Architectures

cs.AI cs.CR cs.LG Gregory M. Ruddell · Mar 22, 2026

This paper introduces "silent commitment failure" — a phenomenon where instruction-tuned language models produce confident, incorrect outputs with no detectable pre-commitment warning signal — and proposes "governability" as a measurable property for AI agent safety. The core claim is that 2 of 3 instruction-following models evaluated exhibit zero-warning failure modes, with profound implications for autonomous agent deployment. The work distinguishes itself from hallucination studies by focusing on detectability before commitment rather than correctness of output, and presents empirical evidence that conflict-detection signals (the "authority band") are geometric properties fixed at pretraining rather than injectable through fine-tuning.

As large language models are deployed as autonomous agents with tool execution privileges, a critical assumption underpins their security architecture: that model errors are detectable at runtime. We present empirical evidence that this assumption fails for two of three instruction-following models evaluable for conflict detection. We introduce governability -- the degree to which a model's errors are detectable before output commitment and correctable once detected -- and demonstrate it varies dramatically across models. In six models across twelve reasoning domains, two of three instruction-following models exhibited silent commitment failure: confident, fluent, incorrect output with zero warning signal. The remaining model produced a detectable conflict signal 57 tokens before commitment under greedy decoding. We show benchmark accuracy does not predict governability, correction capacity varies independently of detection, and identical governance scaffolds produce opposite effects across models. A 2x2 experiment shows a 52x difference in spike ratio between architectures but only +/-0.32x variation from fine-tuning, suggesting governability is fixed at pretraining. We propose a Detection and Correction Matrix classifying model-task combinations into four regimes: Governable, Monitor Only, Steer Blind, and Ungovernable.

Read abstractHide abstract

0

Rethinking SAR ATR: A Target-Aware Frequency-Spatial Enhancement Framework with Noise-Resilient Knowledge Guidance

cs.CV cs.AI Yansong Lin, Zihan Cheng, Jielei Wang et al. · Mar 23, 2026

This paper tackles SAR (Synthetic Aperture Radar) automatic target recognition under coherent speckle noise. It proposes FSCE, a framework combining frequency-domain wavelet decomposition with spatial multi-scale convolutions in a shallow feature enhancement module (DSAF), guided by online knowledge distillation from a ResNet101 teacher. The work matters because SAR imagery suffers from unique multiplicative noise that obscures target features, yet the claimed improvements appear marginal on saturated benchmarks.

Synthetic aperture radar automatic target recognition (SAR ATR) is of considerable importance in marine navigation and disaster monitoring. However, the coherent speckle noise inherent in SAR imagery often obscures salient target features, leading to degraded recognition accuracy and limited model generalization. To address this issue, this paper proposes a target-aware frequency-spatial enhancement framework with noise-resilient knowledge guidance (FSCE) for SAR target recognition. The proposed framework incorporates a frequency-spatial shallow feature adaptive enhancement (DSAF) module, which processes shallow features through spatial multi-scale convolution and frequency-domain wavelet convolution. In addition, a teacher-student learning paradigm combined with an online knowledge distillation method (KD) is employed to guide the student network to focus more effectively on target regions, thereby enhancing its robustness to high-noise backgrounds. Through the collaborative optimization of attention transfer and noise-resilient representation learning, the proposed approach significantly improves the stability of target recognition under noisy conditions. Based on the FSCE framework, two network architectures with different performance emphases are developed: lightweight DSAFNet-M and high-precision DSAFNet-L. Extensive experiments are conducted on the MSTAR, FUSARShip and OpenSARShip datasets. The results show that DSAFNet-L achieves competitive or superior performance compared with various methods on three datasets; DSAFNet-M significantly reduces the model complexity while maintaining comparable accuracy. These results indicate that the proposed FSCE framework exhibits strong cross-model generalization.

Read abstractHide abstract

0

SafePilot: A Framework for Assuring LLM-enabled Cyber-Physical Systems

cs.RO cs.AI Weizhe Xu, Mengyu Liu, Fanxin Kong · Mar 23, 2026

SafePilot addresses a critical gap in deploying Large Language Models (LLMs) for cyber-physical systems (CPS): LLM "hallucinations" can generate plausible-sounding but unsafe plans that violate safety constraints or temporal requirements. The authors propose a hierarchical neuro-symbolic framework that combines LLM planning with formal verification—using First-Order Logic (FOL) for attribute-based constraints and Linear Temporal Logic (LTL) for temporal constraints—to ensure plans satisfy specifications before execution.

Large Language Models (LLMs), deep learning architectures with typically over 10 billion parameters, have recently begun to be integrated into various cyber-physical systems (CPS) such as robotics, industrial automation, and autopilot systems. The abstract knowledge and reasoning capabilities of LLMs are employed for tasks like planning and navigation. However, a significant challenge arises from the tendency of LLMs to produce "hallucinations" - outputs that are coherent yet factually incorrect or contextually unsuitable. This characteristic can lead to undesirable or unsafe actions in the CPS. Therefore, our research focuses on assuring the LLM-enabled CPS by enhancing their critical properties. We propose SafePilot, a novel hierarchical neuro-symbolic framework that provides end-to-end assurance for LLM-enabled CPS according to attribute-based and temporal specifications. Given a task and its specification, SafePilot first invokes a hierarchical planner with a discriminator that assesses task complexity. If the task is deemed manageable, it is passed directly to an LLM-based task planner with built-in verification. Otherwise, the hierarchical planner applies a divide-and-conquer strategy, decomposing the task into sub-tasks, each of which is individually planned and later merged into a final solution. The LLM-based task planner translates natural language constraints into formal specifications and verifies the LLM's output against them. If violations are detected, it identifies the flaw, adjusts the prompt accordingly, and re-invokes the LLM. This iterative process continues until a valid plan is produced or a predefined limit is reached. Our framework supports LLM-enabled CPS with both attribute-based and temporal constraints. Its effectiveness and adaptability are demonstrated through two illustrative case studies.

Read abstractHide abstract

0

Riemannian Geometry Speaks Louder Than Words: From Graph Foundation Model to Next-Generation Graph Intelligence

cs.LG cs.AI Philip S. Yu, Li Sun · Mar 23, 2026

This paper proposes Riemannian Foundation Model (RFM), a vision for unifying graph learning through Riemannian geometry rather than GNN message-passing or LLM serialization. The authors argue that graphs are discrete analogs of manifolds, and that concepts like vector bundles, curvature, and parallel transport provide the proper toolkit for universal graph modeling—enabling both structural inference and generation in a way that current Euclidean GNNs and tokenized LLMs cannot achieve.

Graphs provide a natural description of the complex relationships among objects, and play a pivotal role in communications, transportation, social computing, the life sciences, etc. Currently, there is strong agreement that Graph Foundation Models (GFMs) are essential for advancing graph learning, yet considerable disagreement persists on how to build a powerful, general-purpose GFM analogous to Large Language Models (LLMs). Graph Neural Networks (GNNs) exhibit limitations in memory retention and principled interpretability when confronted with multi-domain pretraining and adaptation. The challenge of graph serialization hinders the direct application of LLMs, as the words struggle to capture the structural complexity and diversity inherent in graphs. In contrast, Riemannian geometry offers an elegant mathematical framework for modeling structures, while remaining compatible with graph semantic learning, even with LLMs. In this paper, we argue that, for graphs, Riemannian geometry speaks louder than words, and lay out the foundational principles for GFM. Reimagining with Riemannian geometry, we introduce a blue sky idea-Riemannian Foundation Model (RFM)-that opens a new pathway for capturing complex structural patterns and uncovering cross-domain generalities. RFM emphasizes intrinsic graph geometry and embodies endogenous capacities for structural inference and generation, moving beyond mere representation-space switching. Accordingly, we outline a progressive agenda that begins with universal structural understanding through intrinsic geometry, and then rebuilds LLM with a Riemannian engine for general-purpose graph modeling and beyond. Thus, RFM enables a paradigm shift from designing graph models to solving graph-structured applications with RFM agents, unlocking the next-generation graph intelligence.

Read abstractHide abstract

0

Fingerprinting Deep Neural Networks for Ownership Protection: An Analytical Approach

cs.CR cs.AI Guang Yang, Ziye Geng, Yihang Chen et al. · Mar 22, 2026

Existing adversarial-example-based fingerprinting schemes rely on empirical heuristics to set the fingerprint-to-boundary distance, risking violations of either robustness or uniqueness. This paper proposes AnaFP, an analytical approach that derives theoretical lower and upper bounds $\tau_{\text{lower}} < \tau < \tau_{\text{upper}}$ on a stretch factor controlling this distance. By formalizing robustness and uniqueness constraints and employing surrogate model pools with quantile-based relaxation, AnaFP generates fingerprints with guaranteed properties, validated across CNNs, MLPs, and GNNs.

Adversarial-example-based fingerprinting approaches, which leverage the decision boundary characteristics of deep neural networks (DNNs) to craft fingerprints, have proven effective for model ownership protection. However, a fundamental challenge remains unresolved: how far a fingerprint should be placed from the decision boundary to simultaneously satisfy two essential properties, i.e., robustness and uniqueness, for effective and reliable ownership protection. Despite the importance of the fingerprint-to-boundary distance, existing works lack a theoretical solution and instead rely on empirical heuristics, which may violate either robustness or uniqueness properties. We propose AnaFP, an analytical fingerprinting scheme that constructs fingerprints under theoretical guidance. Specifically, we formulate fingerprint generation as controlling the fingerprint-to-boundary distance through a tunable stretch factor. To ensure both robustness and uniqueness, we mathematically formalize these properties that determine the lower and upper bounds of the stretch factor. These bounds jointly define an admissible interval within which the stretch factor must lie, thereby establishing a theoretical connection between the two constraints and the fingerprint-to-boundary distance. To enable practical fingerprint generation, we approximate the original (infinite) sets of pirated and independently trained models using two finite surrogate model pools and employ a quantile-based relaxation strategy to relax the derived bounds. Due to the circular dependency between the lower bound and the stretch factor, we apply grid search over the admissible interval to determine the most feasible stretch factor. Extensive experimental results show that AnaFP consistently outperforms prior methods, achieving effective ownership verification across diverse model architectures and model modification attacks.

Read abstractHide abstract

0

Toward a Theory of Hierarchical Memory for Language Agents

cs.IR cs.AI cs.IT Yashar Talebirad, Ali Parsaee, Csongor Y. Szepesvari et al. · Mar 23, 2026

This paper tackles the lack of shared formalism for comparing hierarchical memory systems in language agents. It proposes a unifying theory based on three operators: extraction (α) that maps raw data to atomic units, coarsening (C = (π, ρ)) that partitions and summarizes units, and traversal (τ) that selects content under a token budget. The core insight is the self-sufficiency spectrum of representatives ρ, which constrains viable retrieval strategies—an observation the authors call the coarsening-traversal (C–T) coupling.

Many recent long-context and agentic systems address context-length limitations by adding hierarchical memory: they extract atomic units from raw data, build multi-level representatives by grouping and compression, and traverse this structure to retrieve content under a token budget. Despite recurring implementations, there is no shared formalism for comparing design choices. We propose a unifying theory in terms of three operators. Extraction ($\alpha$) maps raw data to atomic information units; coarsening ($C = (\pi, \rho)$) partitions units and assigns a representative to each group; and traversal ($\tau$) selects which units to include in context given a query and budget. We identify a self-sufficiency spectrum for the representative function $\rho$ and show how it constrains viable retrieval strategies (a coarsening-traversal coupling). Finally, we instantiate the decomposition on eleven existing systems spanning document hierarchies, conversational memory, and agent execution traces, showcasing its generality.

Read abstractHide abstract

0

Efficient Failure Management for Multi-Agent Systems with Reasoning Trace Representation

cs.SE cs.AI Lingzhe Zhang, Tong Jia, Mingyu Wang et al. · Mar 23, 2026

This paper addresses the challenge of efficient failure management in LLM-based Multi-Agent Systems (MASs). Existing approaches rely on expensive per-trace reasoning with large judge LLMs, which is slow and unstable. The core contribution is EAGER, a framework that uses unsupervised reasoning-scoped contrastive learning to encode intra-agent and inter-agent dynamics into embeddings, enabling real-time step-wise failure detection and reflexive mitigation guided by historical patterns rather than costly LLM inference.

Large Language Models (LLM)-based Multi-Agent Systems (MASs) have emerged as a new paradigm in software system design, increasingly demonstrating strong reasoning and collaboration capabilities. As these systems become more complex and autonomous, effective failure management is essential to ensure reliability and availability. However, existing approaches often rely on per-trace reasoning, which leads to low efficiency, and neglect historical failure patterns, limiting diagnostic accuracy. In this paper, we conduct a preliminary empirical study to demonstrate the necessity, potential, and challenges of leveraging historical failure patterns to enhance failure management in MASs. Building on this insight, we propose \textbf{EAGER}, an efficient failure management framework for multi-agent systems based on reasoning trace representation. EAGER employs unsupervised reasoning-scoped contrastive learning to encode both intra-agent reasoning and inter-agent coordination, enabling real-time step-wise failure detection, diagnosis, and reflexive mitigation guided by historical failure knowledge. Preliminary evaluations on three open-source MASs demonstrate the effectiveness of EAGER and highlight promising directions for future research in reliable multi-agent system operations.

Read abstractHide abstract

0

Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications

cs.IT cs.AI cs.SY Che Chen, Lanhua Li, Shimin Gong et al. · Mar 23, 2026

The paper addresses multi-UAV coordination under intermittent communications by proposing a Spatio-Temporal Attention enhanced MADRL (STA-MADRL) framework. It combines delay-penalized rewards to incentivize information exchange with a prediction module that recovers missing state data using temporal and spatial attention mechanisms. The authors claim 75% throughput improvements over communication-limited baselines while achieving near-ideal performance without requiring real-time global state sharing.

In this paper, we employ multiple UAVs to accelerate data transmissions from ground users (GUs) to a remote base station (BS) via the UAVs' relay communications. The UAVs' intermittent information exchanges typically result in delays in acquiring the complete system state and hinder their effective collaboration. To maximize the overall throughput, we first propose a delay-tolerant multi-agent deep reinforcement learning (MADRL) algorithm that integrates a delay-penalized reward to encourage information sharing among UAVs, while jointly optimizing the UAVs' trajectory planning, network formation, and transmission control strategies. Additionally, considering information loss due to unreliable channel conditions, we further propose a spatio-temporal attention based prediction approach to recover the lost information and enhance each UAV's awareness of the network state. These two designs are envisioned to enhance the network capacity in UAV-assisted wireless networks with limited communications. The simulation results reveal that our new approach achieves over 50\% reduction in information delay and 75% throughput gain compared to the conventional MADRL. Interestingly, it is shown that improving the UAVs' information sharing will not sacrifice the network capacity. Instead, it significantly improves the learning performance and throughput simultaneously. It is also effective in reducing the need for UAVs' information exchange and thus fostering practical deployment of MADRL in UAV-assisted wireless networks.

Read abstractHide abstract

Nothing here yet