Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
199 papers in cs.AI
Trending mixes fresh papers with community signal.
0
cs.DCcs.AIcs.LG Peihan Ye, Alfreds Lapkovskis, Alaa Saleh et al. · Mar 22, 2026

Modern AI services increasingly run across the computing continuum—from cloud to edge devices—yet fault management remains challenging due to resource constraints, noisy telemetry, and cascading failures. This paper proposes NeSy-Edge, a three-layer neuro-symbolic framework that performs local log parsing, causal graph construction, and root-cause analysis on edge nodes, invoking cloud LLMs only when local evidence is insufficient. The core idea is to combine lightweight symbolic caching and prior-constrained causal discovery with selective neural inference, trading off autonomy against accuracy under strict memory budgets ($\sim$1500 MB).

The computational demands of modern AI services are increasingly shifting execution beyond centralized clouds toward a computing continuum spanning edge and end devices. However, the scale, heterogeneity, and cross-layer dependencies of these environments make resilience difficult to maintain. Existing fault-management methods are often too static, fragmented, or heavy to support timely self-healing, especially under noisy logs and edge resource constraints. To address these limitations, this paper presents NeSy-Edge, a neuro-symbolic framework for trustworthy self-healing in the computing continuum. The framework follows an edge-first design, where a resource-constrained edge node performs local perception and reasoning, while a cloud model is invoked only at the final diagnosis stage. Specifically, NeSy-Edge converts raw runtime logs into structured event representations, builds a prior-constrained sparse symbolic causal graph, and integrates causal evidence with historical troubleshooting knowledge for root-cause analysis and recovery recommendation. We evaluate our work on representative Loghub datasets under multiple levels of semantic noise, considering parsing quality, causal reasoning, end-to-end diagnosis, and edge-side resource usage. The results show that NeSy-Edge remains robust even at the highest noise level, achieving up to 75% root-cause analysis accuracy and 65% end-to-end accuracy while operating within about 1500 MB of local memory.
0
cs.AI Zhuojie Yang, Wentao Wan, Keze Wang · Mar 22, 2026

ORACLE addresses the problem of verifying intermediate reasoning steps in synthetic LLM training data, where filtering by final answer correctness often preserves spurious reasoning paths. The method combines a structured syllogistic template (<QUERY>, <FACTS>, <RULE>, <REVISION>) with a symbolic reasoning engine (Pyke) to validate steps during beam search, generating preference data for DPO. This hybrid approach matters because it attempts to bring formal verification to natural language reasoning tasks where code execution and pure LLM evaluation fall short.

Training large language models (LLMs) with synthetic reasoning data has become a popular approach to enhancing their reasoning capabilities, while a key factor influencing the effectiveness of this paradigm is the quality of the generated multi-step reasoning data. To generate high-quality reasoning data, many recent methods generate synthetic reasoning paths and filter them based on final answer correctness, often overlooking flaws in intermediate reasoning steps. To enhance the verification of intermediate reasoning steps, prior work primarily resorts to code execution or symbolic reasoning engines. However, code-based validation is restricted to code or mathematical tasks, and reasoning engines require a well-structured and complete context. As a result, existing methods fail to function effectively in natural language reasoning tasks that involve ambiguous or incomplete contexts. In these tasks, synthetic data still lack reliable checks for verifying each reasoning step. To address this challenge, we introduce ORACLE, a structured data generation framework inspired by syllogistic reasoning. ORACLE integrates the generative strengths of LLMs with symbolic supervision: the LLM produces step-wise reasoning contexts, while a symbolic reasoning engine verifies the validity of each intermediate step. By employing a unified prompting template to elicit modular reasoning chains, ORACLE enables fine-grained, step-level validation, facilitating the construction of high-quality multi-step reasoning data. Across six logical, factual, and commonsense reasoning benchmarks, our ORACLE consistently outperforms strong baselines on multiple models.
0
cs.CVcs.AI Yu-Wen Tseng, Xingyi Zheng, Ya-Chen Wu et al. · Mar 22, 2026

This paper tackles Practical Test-Time Adaptation (PTTA), where models must adapt to temporally correlated, non-i.i.d. test streams without source data. Unlike prior work that stores samples in a single pool, the authors propose Multi-Cluster Memory (MCM)—organizing memory into multiple clusters based on pixel-level descriptors. The core insight, validated via Gaussian Mixture Model analysis, is that PTTA streams are inherently multi-modal (optimal K* ≈ 6–10), making single-cluster memory structurally mismatched. MCM introduces descriptor-based assignment, Adjacent Cluster Consolidation (ACC), and Uniform Cluster Retrieval (UCR), achieving consistent gains up to 12.13% on DomainNet.

Test-time adaptation (TTA) adapts pre-trained models to distribution shifts at inference using only unlabeled test data. Under the Practical TTA (PTTA) setting, where test streams are temporally correlated and non-i.i.d., memory has become an indispensable component for stable adaptation, yet existing methods universally store amples in a single unstructured pool. We show that this single-cluster design is fundamentally mismatched to PTTA: a stream clusterability analysis reveals that test streams are inherently multi-modal, with the optimal number of mixture components consistently far exceeding one. To close this structural gap, we propose Multi-Cluster Memory (MCM), a plug-and-play framework that organizes stored samples into multiple clusters using lightweight pixel-level statistical descriptors. MCM introduces three complementary mechanisms: descriptor-based cluster assignment to capture distinct distributional modes, Adjacent Cluster Consolidation (ACC) to bound memory usage by merging the most similar temporally adjacent clusters, and Uniform Cluster Retrieval (UCR) to ensure balanced supervision across all modes during adaptation. Integrated with three contemporary TTA methods on CIFAR-10-C, CIFAR-100-C, ImageNet-C, and DomainNet, MCM achieves consistent improvements across all 12 configurations, with gains up to 5.00% on ImageNet-C and 12.13% on DomainNet. Notably, these gains scale with distributional complexity: larger label spaces with greater multi-modality benefit most from multi-cluster organization. GMM-based memory diagnostics further confirm that MCM maintains near-optimal distributional balance, entropy, and mode coverage, whereas single-cluster memory exhibits persistent imbalance and progressive mode loss. These results establish memory organization as a key design axis for practical test-time adaptation.
0
cs.AIcs.CL Jianing Wang, Jianfei Zhang, Qi Guo et al. · Mar 22, 2026

LongCat-Flash-Prover is a 560B-parameter MoE open-source model targeting native formal reasoning in Lean4. The core innovation is decomposing formal theorem proving into three agentic capabilities—auto-formalization, sketching, and proving—trained via a Hybrid-Experts Iteration Framework and a novel RL algorithm called HisPO. The work claims state-of-the-art results on MiniF2F-Test (97.1%), ProverBench (70.8%), and PutnamBench (41.5%) with remarkably low inference budgets compared to prior open-source provers.

We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.
0
cs.LGcs.AIcs.MM Bing Wang, Ximing Li, Changchun Li et al. · Mar 22, 2026

This paper tackles multimodal misinformation detection by distinguishing between harmful and harmless visual content manipulation—a nuance often overlooked by existing methods. The authors propose Havc-m4d, a framework that extracts manipulation and intention features using weakly-supervised positive-unlabeled (PU) learning to overcome the lack of ground-truth manipulation labels. By treating real articles with manipulated visuals as likely harmless and fake articles as potentially harmful, the method introduces intention-aware cues that consistently improve detection across four benchmark datasets.

Nowadays, the widespread dissemination of misinformation across numerous social media platforms has led to severe negative effects on society. To address this challenge, the automatic detection of misinformation, particularly under multimedia scenarios, has gained significant attention from both academic and industrial communities, leading to the emergence of a research task known as Multimodal Misinformation Detection (MMD). Typically, current MMD approaches focus on capturing the semantic relationships and inconsistency between various modalities but often overlook certain critical indicators within multimodal content. Recent research has shown that manipulated features within visual content in social media articles serve as valuable clues for MMD. Meanwhile, we argue that the potential intentions behind the manipulation, e.g., harmful and harmless, also matter in MMD. Therefore, in this study, we aim to identify such multimodal misinformation by capturing two types of features: manipulation features, which represent if visual content has been manipulated, and intention features, which assess the nature of these manipulations, distinguishing between harmful and harmless intentions. Unfortunately, the manipulation and intention labels that supervise these features to be discriminative are unknown. To address this, we introduce two weakly supervised indicators as substitutes by incorporating supplementary datasets focused on image manipulation detection and framing two different classification tasks as positive and unlabeled learning issues. With this framework, we introduce an innovative MMD approach, titled Harmful Visual Content Manipulation Matters in MMD (HAVC-M4 D). Comprehensive experiments conducted on four prevalent MMD datasets indicate that HAVC-M4 D significantly and consistently enhances the performance of existing MMD methods.
0
cs.CVcs.AI Gia-Bao Doan, Nam-Khoa Huynh, Minh-Nhat-Huy Ho et al. · Mar 22, 2026

This paper addresses temporal action localization (TAL) for distracted driver behaviors in untrimmed in-cabin videos, a critical task for intelligent transportation systems. The authors propose a two-stage framework combining VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module for multi-scale temporal modeling. The work targets deployment scenarios such as fleet management and transportation safety checkpoints, aiming to balance accuracy against computational constraints.

The identification of hazardous driving behaviors from in-cabin video streams is essential for enhancing road safety and supporting the detection of traffic violations and unsafe driver actions. However, current temporal action localization techniques often struggle to balance accuracy with computational efficiency. In this work, we develop and evaluate a temporal action localization framework tailored for driver monitoring scenarios, particularly suitable for periodic inspection settings such as transportation safety checkpoints or fleet management assessment systems. Our approach follows a two-stage pipeline that combines VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector, enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module to capture multi-scale temporal features. Experimental results reveal a distinct trade-off between model capacity and efficiency. At the feature extraction stage, the ViT-Giant backbone delivers higher representations with 88.09% Top-1 test accuracy, while the ViT-based variant proves to be a practical alternative, achieving 82.55% accuracy with significantly lower computational fine-tuning costs (101.85 GFLOPs/segment compared to 1584.06 GFLOPs/segment for Giant). In the downstream localization task, the integration of SPPF consistently improves performance across all configurations. Notably, the ViT-Giant + SPPF model achieves a peak mAP of 92.67%, while the lightweight ViT-based configuration maintains robust results.
0
cs.CVcs.AI Wen Jiang, Kangyao Huang, Li Wang et al. · Mar 22, 2026

UAV vision-and-language navigation suffers from a structural mismatch between 2D visual perception and 3D trajectory decision-making. SpatialFly bridges this gap via a geometry-guided 2D representation alignment mechanism (G2RA) that injects implicit 3D geometric priors from a pretrained geometry encoder into 2D semantic tokens without explicit 3D reconstruction. Operating on RGB-only observations, the method outperforms state-of-the-art baselines on the OpenUAV benchmark, reducing navigation error by over 4 meters in unseen environments.

UAVs play an important role in applications such as autonomous exploration, disaster response, and infrastructure inspection. However, UAV VLN in complex 3D environments remains challenging. A key difficulty is the structural representation mismatch between 2D visual perception and the 3D trajectory decision space, which limits spatial reasoning. To this end, we propose SpatialFly, a geometry-guided spatial representation framework for UAV VLN. Operating on RGB observations without explicit 3D reconstruction, SpatialFly introduces a geometry-guided 2D representation alignment mechanism. Specifically, the geometric prior injection module injects global structural cues into 2D semantic tokens to provide scene-level geometric guidance. The geometry-aware reparameterization module then aligns 2D semantic tokens with 3D geometric tokens through cross-modal attention, followed by gated residual fusion to preserve semantic discrimination. Experimental results show that SpatialFly consistently outperforms state-of-the-art UAV VLN baselines across both seen and unseen environments, reducing NE by 4.03m and improving SR by 1.27% over the strongest baseline on the unseen Full split. Additional trajectory-level analysis shows that SpatialFly produces trajectories with better path alignment and smoother, more stable motion.
0
cs.CVcs.AI Shuwei Huang, Shizhuo Liu, Zijun Wei · Mar 22, 2026

LPNSR tackles the efficiency-quality trade-off in diffusion-based image super-resolution, specifically improving upon the 4-step ResShift framework. The core idea is to replace random Gaussian noise in intermediate diffusion steps with an LR-guided noise predictor that approximates a theoretically derived optimal noise, while also replacing bicubic upsampling with a pretrained regression network for better initialization. The method achieves strong perceptual results without relying on large-scale text-to-image priors.

Diffusion-based image super-resolution (SR), which aims to reconstruct high-resolution (HR) images from corresponding low-resolution (LR) observations, faces a fundamental trade-off between inference efficiency and reconstruction quality. The state-of-the-art residual-shifting diffusion framework achieves efficient 4-step inference, yet suffers from severe performance degradation in compact sampling trajectories. This is mainly attributed to two core limitations: the inherent suboptimality of unconstrained random Gaussian noise in intermediate steps, which leads to error accumulation and insufficient LR prior guidance, and the initialization bias caused by naive bicubic upsampling. In this paper, we propose LPNSR, a prior-enhanced efficient diffusion framework to address these issues. We first mathematically derive the closed-form analytical solution of the optimal intermediate noise for the residual-shifting diffusion paradigm, and accordingly design an LR-guided multi-input-aware noise predictor to replace random Gaussian noise, embedding LR structural priors into the reverse process while fully preserving the framework's core efficient residual-shifting mechanism. We further mitigate initial bias with a high-quality pre-upsampling network to optimize the diffusion starting point. With a compact 4-step trajectory, LPNSR can be optimized in an end-to-end manner. Extensive experiments demonstrate that LPNSR achieves state-of-the-art perceptual performance on both synthetic and real-world datasets, without relying on any large-scale text-to-image priors. The source code of our method can be found at https://github.com/Faze-Hsw/LPNSR.
0
cs.AI Ye Tian, Jingyi Zhang, Zihao Wang et al. · Mar 22, 2026

KLDrive addresses fine-grained 3D scene question answering for autonomous driving by coupling an energy-based model for reliable scene knowledge graph construction with a frozen LLM agent that reasons over a constrained symbolic action space. The core insight is that decoupling noisy perception (handled by an EBM that refines multi-source camera and LiDAR detections) from interpretable reasoning (handled by a tool-using LLM with explicit Plan-Execute-Observe loops) substantially reduces hallucinations. The system achieves 65.04\% accuracy on NuScenes-QA and a 46.01 percentage point improvement on counting tasks over prior state-of-the-art, without task-specific fine-tuning of the LLM backbone.

Autonomous driving requires reliable reasoning over fine-grained 3D scene facts. Fine-grained question answering over multi-modal driving observations provides a natural way to evaluate this capability, yet existing perception pipelines and driving-oriented large language model (LLM) methods still suffer from unreliable scene facts, hallucinations, opaque reasoning, and heavy reliance on task-specific training. We present KLDrive, the first knowledge-graph-augmented LLM reasoning framework for fine-grained question answering in autonomous driving. KLDrive addresses this problem through designing two tightly coupled components: an energy-based scene fact construction module that consolidates multi-source evidence into a reliable scene knowledge graph, and an LLM agent that performs fact-grounded reasoning over a constrained action space under explicit structural constraints. By combining structured prompting with few-shot in-context exemplars, the framework adapts to diverse reasoning tasks without heavy task-specific fine-tuning. Experiments on two large-scale autonomous-driving QA benchmarks show that KLDrive outperforms prior state-of-the-art methods, achieving the best overall accuracy of 65.04% on NuScenes-QA and the best SPICE score of 42.45 on GVQA. On counting, the most challenging factual reasoning task, it improves over the strongest baseline by 46.01 percentage points, demonstrating substantially reduced hallucinations and the benefit of coupling reliable scene fact construction with explicit reasoning.
0
cs.CLcs.AIcs.LG Jinquan Zheng, Jia Yuan, Jiacheng Yao et al. · Mar 22, 2026

This paper addresses selection bias (position and label bias) in large language models during discrete-choice tasks like multiple-choice questions and pairwise evaluation. The authors propose Permutation-Aware GRPO (PA-GRPO), which extends Group Relative Policy Optimization by treating different permutations of the same question as a single training group rather than independent instances. The method enforces semantic consistency across permutations through two mechanisms: a cross-permutation advantage that computes rewards relative to the group mean, and a consistency-aware reward that penalizes disagreement across permutations. Experiments across seven benchmarks and three models (Llama-3.1-8B, Qwen3-8B, Qwen3-32B) demonstrate that PA-GRPO reduces selection bias while maintaining accuracy.

Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU-Text-Computing/PA-GRPO).
0
cs.AIcs.GTcs.LG Benedikt Hornig, Reuth Mirsky · Mar 22, 2026

This paper addresses the challenge of "intelligent disobedience" in shared autonomy — when assistive AI must override human commands to prevent harm but remain helpful. The authors formalize this as the Intelligent Disobedience Game (IDG), a sequential Stackelberg game where a human leader proposes actions and an assistive follower with superior environmental awareness decides whether to obey or intervene. The framework aims to provide the mathematical foundations for training safety-critical assistive systems.

In shared autonomy, a critical tension arises when an automated assistant must choose between obeying a human's instruction and deliberately overriding it to prevent harm. This safety-critical behavior is known as intelligent disobedience. To formalize this dynamic, this paper introduces the Intelligent Disobedience Game (IDG), a sequential game-theoretic framework based on Stackelberg games that models the interaction between a human leader and an assistive follower operating under asymmetric information. It characterizes optimal strategies for both agents across multi-step scenarios, identifying strategic phenomena such as ``safety traps,'' where the system indefinitely avoids harm but fails to achieve the human's goal. The IDG provides a needed mathematical foundation that enables both the algorithmic development of agents that can learn safe non-compliance and the empirical study of how humans perceive and trust disobedient AI. The paper further translates the IDG into a shared control Multi-Agent Markov Decision Process representation, forming a compact computational testbed for training reinforcement learning agents.
0
cs.LGcs.AI Yuma Aoki, Joon Park, Koh Takeuchi et al. · Mar 22, 2026

This paper addresses the problem of forecasting outlier events far in advance in time series data, rather than merely detecting immediate anomalies. The authors propose a two-layer framework that first computes outlier scores using standard detection methods, then models the temporal structure of these scores to predict future anomalies. By assuming that outlier occurrences exhibit temporal patterns (e.g., periodicity or delayed dependencies), the method aims to forecast outlier likelihoods without requiring future observations.

This study addresses an important gap in time series outlier detection by proposing a novel problem setting: long-term outlier prediction. Conventional methods primarily focus on immediate detection by identifying deviations from normal patterns. As a result, their applicability is limited when forecasting outlier events far into the future. To overcome this limitation, we propose a simple and unsupervised two-layer method that is independent of specific models. The first layer performs standard outlier detection, and the second layer predicts future outlier scores based on the temporal structure of previously observed outliers. This framework enables not only pointwise detection but also long-term forecasting of outlier likelihoods. Experiments on synthetic datasets show that the proposed method performs well in both detection and prediction tasks. These findings suggest that the method can serve as a strong baseline for future work in outlier detection and forecasting.
0
cs.LGcs.AIcs.CL Abhinaba Basu · Mar 22, 2026

This paper investigates why compressing different weight matrices in transformers leads to wildly different outcomes—from negligible impact to 20,000× perplexity increases. The authors map this structural sensitivity across five architectures, revealing that early-layer MLP up-projections are catastrophically fragile while value projections are nearly free to compress. Using Lyapunov stability theory, they explain how residual connections contract errors, and they provide machine-checked formal bounds in Lean 4 to guarantee per-matrix approximation quality.

A single matrix out of 468 in GPT-2 Small can increase perplexity by 20,000x when compressed, revealing that transformer compression sensitivity spans five orders of magnitude. We map this sensitivity landscape across five architectures (117M-8B parameters), finding a consistent hierarchy: early-layer MLP up-projections are catastrophically sensitive while value projections compress nearly for free. This hierarchy is stable across compression levels, evaluation scales (2K-51K tokens), and datasets (WikiText-103, C4). Using Lyapunov stability theory, we show that residual connections contract compression errors by growing the hidden state faster than the error. Error contraction is necessary but not sufficient for compression tolerance: architecture-specific redundancy plays an equally important role, as demonstrated by the hybrid LFM2-2.6B degrading only 7x despite higher amplification than the fully-contracting GPT-2 Small (120x). Ten machine-checked Lean 4 theorems formalize per-matrix error bounds with no sorry markers; all bounds produce zero violations across 14,040+ configurations. We validate with downstream task evaluation (HellaSwag, ARC-Easy, Winogrande), activation-aware pruning on two architectures, and a Compression Fragility Index that rank-orders model robustness.
0
cs.IRcs.AI Aarush Sinha, Rahul Seetharaman, Aman Bansal · Mar 22, 2026

This paper introduces ECI (Effective Contrastive Information), a training-free metric for evaluating hard-negative mining strategies in dense retrieval. The core idea is to leverage the logarithmic InfoNCE bound on mutual information combined with a harmonic mean of signal (hardness) and safety (margin) to predict downstream retrieval quality without expensive fine-tuning. The proposed metric addresses a real pain point in retrieval research: practitioners currently must run end-to-end ablation studies to evaluate negative sampling strategies, which is computationally wasteful.

Hard negatives play a critical role in training and fine-tuning dense retrieval models, as they are semantically similar to positive documents yet non-relevant, and correctly distinguishing them is essential for improving retrieval accuracy. However, identifying effective hard negatives typically requires extensive ablation studies involving repeated fine-tuning with different negative sampling strategies and hyperparameters, resulting in substantial computational cost. In this paper, we introduce ECI: Effective Contrastive Information , a theoretically grounded metric grounded in Information Theory and Information Retrieval principles that enables practitioners to assess the quality of hard negatives prior to model fine-tuning. ECI evaluates negatives by optimizing the trade-off between Information Capacity the logarithmic bound on mutual information determined by set size and Discriminative Efficiency, a harmonic balance of Signal Magnitude (Hardness) and Safety (Max-Margin). Unlike heuristic approaches, ECI strictly penalizes unsafe, false-positive negatives prevalent in generative methods. We evaluate ECI across hard-negative sets mined or generated using BM25, cross-encoders, and large language models. Our results demonstrate that ECI accurately predicts downstream retrieval performance, identifying that hybrid strategies (BM25+Cross-Encoder) offer the optimal balance of volume and reliability, significantly reducing the need for costly end-to-end ablation studies.
0
cs.AIq-bio.NC Akshay K. Jagadish, Milena Rmus, Kristin Witte et al. · Mar 22, 2026

This paper proposes automating the entire cognitive science discovery pipeline—experiment design, behavioral data simulation via foundation models, model synthesis through LLM program generation, and iterative refinement via an "interestingness" critic—to overcome the slow pace and bias of manual research. The vision is a high-throughput in-silico engine that searches vast algorithmic and experimental spaces to surface theoretically informative mechanisms for human validation.

The cognitive sciences aim to understand intelligence by formalizing underlying operations as computational models. Traditionally, this follows a cycle of discovery where researchers develop paradigms, collect data, and test predefined model classes. However, this manual pipeline is fundamentally constrained by the slow pace of human intervention and a search space limited by researchers' background and intuition. Here, we propose a paradigm shift toward a fully automated, in silico science of the mind that implements every stage of the discovery cycle using Large Language Models (LLMs). In this framework, experimental paradigms exploring conceptually meaningful task structures are directly sampled from an LLM. High-fidelity behavioral data are then simulated using foundation models of cognition. The tedious step of handcrafting cognitive models is replaced by LLM-based program synthesis, which performs a high-throughput search over a vast landscape of algorithmic hypotheses. Finally, the discovery loop is closed by optimizing for ''interestingness'', a metric of conceptual yield evaluated by an LLM-critic. By enabling a fast and scalable approach to theory development, this automated loop functions as a high-throughput in-silico discovery engine, surfacing informative experiments and mechanisms for subsequent validation in real human populations.
0
cs.AIcond-mat.mes-hall Sukriti Manna, Henry Chan, Subramanian K.R.S. Sankaranarayanan · Mar 22, 2026

AutoMOOSE introduces a multi-agent AI framework to automate the full lifecycle of phase-field simulations in MOOSE, from natural-language prompts to quantitative kinetics analysis. The system orchestrates five specialized agents that generate syntactically valid input files, execute parallel parameter sweeps, autonomously recover from convergence failures, and verify physical consistency through Arrhenius analysis. Validated on copper grain growth, it demonstrates that LLM-driven orchestration can bridge the gap between scientific intent and executable multiphysics simulations, yielding results statistically comparable to expert-authored workflows.

Multiphysics simulation frameworks such as MOOSE provide rigorous engines for phase-field materials modeling, yet adoption is constrained by the expertise required to construct valid input files, coordinate parameter sweeps, diagnose failures, and extract quantitative results. We introduce AutoMOOSE, an open-source agentic framework that orchestrates the full simulation lifecycle from a single natural-language prompt. AutoMOOSE deploys a five-agent pipeline in which the Input Writer coordinates six sub-agents and the Reviewer autonomously corrects runtime failures without user intervention. A modular plugin architecture enables new phase-field formulations without modifying the core framework, and a Model Context Protocol (MCP) server exposes the workflow as ten structured tools for interoperability with any MCP-compatible client. Validated on a four-temperature copper grain growth benchmark, AutoMOOSE generates MOOSE input files with 6 of 12 structural blocks matching a human expert reference exactly and 4 functionally equivalent, executes all runs in parallel with a 1.8x speedup, and performs an end-to-end physical consistency check spanning intent, finite-element execution, and Arrhenius kinetics with no human verification. Grain coarsening kinetics are recovered with R^2 = 0.90-0.95 at T >= 600 K; the recovered activation energy Q_fit = 0.296 eV is consistent with a human-written reference (Q_fit = 0.267 eV) under identical parameters. Three runtime failure classes were diagnosed and resolved autonomously within a single correction cycle, and every run produces a provenance record satisfying FAIR data principles. These results show that the gap between knowing the physics and executing a validated simulation campaign can be bridged by a lightweight multi-agent orchestration layer, providing a pathway toward AI-driven materials discovery and self-driving laboratories.
0
cs.CLcs.AIcs.LG NVIDIA, :, Aaron Blakeman et al. · Dec 24, 2025

NVIDIA introduces Nemotron 3, a family of open language models (Nano, Super, Ultra) built on a hybrid Mamba-Transformer MoE architecture. The core innovation is using selective attention layers combined with Mamba-2 state space layers to achieve high throughput while maintaining accuracy. Key technical contributions include LatentMoE (dimensionality-reduced expert routing), NVFP4 training for efficiency, and multi-environment RL post-training. The paper positions these models as optimized for agentic AI with up to 1M token contexts and granular inference-time reasoning budget control.

We introduce the Nemotron 3 family of models - Nano, Super, and Ultra. These models deliver strong agentic, reasoning, and conversational capabilities. The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. Super and Ultra models are trained with NVFP4 and incorporate LatentMoE, a novel approach that improves model quality. The two larger models also include MTP layers for faster text generation. All Nemotron 3 models are post-trained using multi-environment reinforcement learning enabling reasoning, multi-step tool use, and support granular reasoning budget control. Nano, the smallest model, outperforms comparable models in accuracy while remaining extremely cost-efficient for inference. Super is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Ultra, the largest model, provides state-of-the-art accuracy and reasoning performance. Nano is released together with its technical report and this white paper, while Super and Ultra will follow in the coming months. We will openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights.
0
cs.ROcs.AI Junhyeok Rui Cha, Woohyun Cha, Jaeyong Shin et al. · Mar 23, 2026
This paper proposes a novel alternative to existing sim-to-real methods for training control policies with simulated experiences. Unlike prior methods that typically rely on domain randomization over a fixed finite set of parameters, the proposed approach injects state-dependent perturbations into the input joint torque during forward simulation. These perturbations are designed to simulate a broader spectrum of reality gaps than standard parameter randomization without requiring additional training. By using neural networks as flexible perturbation generators, the proposed method can represent complex, state-dependent uncertainties, such as nonlinear actuator dynamics and contact compliance, that parametric randomization cannot capture. Experimental results demonstrate that the proposed approach enables humanoid locomotion policies to achieve superior robustness against complex, unseen reality gaps in both simulation and real-world deployment.
This paper proposes a novel alternative to existing sim-to-real methods for training control policies with simulated experiences. Unlike prior methods that typically rely on domain randomization over a fixed finite set of parameters, the proposed approach injects state-dependent perturbations into the input joint torque during forward simulation. These perturbations are designed to simulate a broader spectrum of reality gaps than standard parameter randomization without requiring additional training. By using neural networks as flexible perturbation generators, the proposed method can represent complex, state-dependent uncertainties, such as nonlinear actuator dynamics and contact compliance, that parametric randomization cannot capture. Experimental results demonstrate that the proposed approach enables humanoid locomotion policies to achieve superior robustness against complex, unseen reality gaps in both simulation and real-world deployment.
0
cs.AIcs.CV Sheng Liu, Long Chen, Zeyun Zhao et al. · Mar 23, 2026
Modern clinical practice increasingly depends on reasoning over heterogeneous, evolving, and incomplete patient data. Although recent advances in multimodal foundation models have improved performance on various clinical tasks, most existing models remain static, opaque, and poorly aligned with real-world clinical workflows. We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis. These outputs are synthesized into a clinician-facing dashboard that combines visual analytics with a conversational interface, enabling clinicians to interrogate predictions and contextualize risk at the point of care. Cerebra supports privacy-preserving deployment by operating on structured representations and remains robust when modalities are incomplete. We evaluated Cerebra using a massive multi-institutional dataset spanning 3 million patients from four independent healthcare systems. Cerebra consistently outperformed both state-of-the-art single-modality models and large multimodal language model baselines. In dementia risk prediction, it achieved AUROCs up to 0.80, compared with 0.74 for the strongest single-modality model and 0.68 for language model baselines. For dementia diagnosis, it achieved an AUROC of 0.86, and for survival prediction, a C-index of 0.81. In a reader study with experienced physicians, Cerebra significantly improved expert performance, increasing accuracy by 17.5 percentage points in prospective dementia risk estimation. These results demonstrate Cerebra's potential for interpretable, robust decision support in clinical care.
Modern clinical practice increasingly depends on reasoning over heterogeneous, evolving, and incomplete patient data. Although recent advances in multimodal foundation models have improved performance on various clinical tasks, most existing models remain static, opaque, and poorly aligned with real-world clinical workflows. We present Cerebra, an interactive multi-agent AI team that coordinates specialized agents for EHR, clinical notes, and medical imaging analysis. These outputs are synthesized into a clinician-facing dashboard that combines visual analytics with a conversational interface, enabling clinicians to interrogate predictions and contextualize risk at the point of care. Cerebra supports privacy-preserving deployment by operating on structured representations and remains robust when modalities are incomplete. We evaluated Cerebra using a massive multi-institutional dataset spanning 3 million patients from four independent healthcare systems. Cerebra consistently outperformed both state-of-the-art single-modality models and large multimodal language model baselines. In dementia risk prediction, it achieved AUROCs up to 0.80, compared with 0.74 for the strongest single-modality model and 0.68 for language model baselines. For dementia diagnosis, it achieved an AUROC of 0.86, and for survival prediction, a C-index of 0.81. In a reader study with experienced physicians, Cerebra significantly improved expert performance, increasing accuracy by 17.5 percentage points in prospective dementia risk estimation. These results demonstrate Cerebra's potential for interpretable, robust decision support in clinical care.