Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
199 papers in cs.AI
Trending mixes fresh papers with community signal.
0
cs.AI Qihui Zhu, Shouwei Ruan, Xiao Yang et al. · Mar 23, 2026

This paper addresses a critical gap in embodied AI: while Multimodal Large Language Models (MLLMs) excel at reactive, short-horizon planning, they fail at biological-like "mental navigation" — the ability to construct global cognitive maps from long egocentric videos and simulate paths before acting. To tackle this, the authors introduce Video2Mental, a benchmark requiring models to generate structured hierarchical cognitive maps from videos exceeding five minutes, then produce landmark-grounded navigation plans validated in the Habitat simulator. They also propose NavMind, a Qwen3-VL-based model trained via difficulty-stratified progressive supervised fine-tuning to internalize these structured representations.

Despite the widespread adoption of MLLMs in embodied agents, their capabilities remain largely confined to reactive planning from immediate observations, consistently failing in spatial reasoning across extensive spatiotemporal scales. Cognitive science reveals that Biological Intelligence (BI) thrives on "mental navigation": the strategic construction of spatial representations from experience and the subsequent mental simulation of paths prior to action. To bridge the gap between AI and BI, we introduce Video2Mental, a pioneering benchmark for evaluating the mental navigation capabilities of MLLMs. The task requires constructing hierarchical cognitive maps from long egocentric videos and generating landmark-based path plans step by step, with planning accuracy verified through simulator-based physical interaction. Our benchmarking results reveal that mental navigation capability does not naturally emerge from standard pre-training. Frontier MLLMs struggle profoundly with zero-shot structured spatial representation, and their planning accuracy decays precipitously over extended horizons. To overcome this, we propose \textbf{NavMind}, a reasoning model that internalizes mental navigation using explicit, fine-grained cognitive maps as learnable intermediate representations. Through a difficulty-stratified progressive supervised fine-tuning paradigm, NavMind effectively bridges the gap between raw perception and structured planning. Experiments demonstrate that NavMind achieves superior mental navigation capabilities, significantly outperforming frontier commercial and spatial MLLMs.
0
cs.CVcs.AI Xu Liu, Yongheng Zhang, Qiguang Chen et al. · Mar 23, 2026

This paper tackles the inefficiency of Interleaved-Modal Chain-of-Thought (ICoT) reasoning, where current methods statically insert visual tokens after every reasoning step, wasting compute on redundant image embeddings and using semantically broken patches. DaP-ICoT introduces a confidence-aware gating mechanism that only pulls visual context when model certainty drops below a threshold, combined with SAM2-based object segmentation to provide coherent visual thoughts instead of fragmented patches.

Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. (2) Precise Visual Thought Guidance ensures visual semantically coherent and contextually aligned representations. Experiments across multiple benchmarks and models demonstrate that DaP-ICoT achieves state-of-the-art performance. In addition, DaP-ICoT significantly reduces the number of inserted images, leading to a 72.6% decrease in token consumption, enabling more efficient ICoT reasoning.
0
cs.LGcs.AIcs.CL Hung-Hsuan Chen · Mar 23, 2026

Standard Transformers apply fixed-depth computation regardless of problem difficulty, limiting their ability to solve tasks requiring variable-depth reasoning like multi-hop traversal or nested logic. This paper proposes a depth-recurrent Transformer that iteratively applies a shared-weight block in latent space—enabling 'vertical Chain-of-Thought' where models trade recurrence steps for deeper reasoning without consuming context window. The work demonstrates strong compositional generalization on three synthetic tasks and offers a mechanistic alternative to horizontal token-generation paradigms.

Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent Transformer that decouples computational depth from parameter count by iteratively applying a shared-weight Transformer block in latent space -- enabling the model to trade recurrence steps for deeper reasoning at inference time. Our architecture incorporates three mechanisms to make deep recurrence (20+ steps) stable: (1) a silent thinking objective that supervises only the final output, forcing genuine multi-step reasoning rather than intermediate heuristic shortcuts; (2) LayerScale initialization to protect fragile reasoning states from untrained layer noise; and (3) an identity-biased recurrence that creates a gradient highway across many steps. We evaluate on three compositional reasoning domains with decreasing inductive biases: graph reachability (strict adjacency masking), nested boolean logic (relative positioning), and unstructured relational text (where sequence position provides no structural hints). Across all tasks, we observe a clear \emph{computational frontier} -- a boundary where performance transitions from chance to near-perfect as thinking steps scale with task complexity. Moreover, these tasks reveal qualitatively different generalization behaviors: precise but brittle (graph), approximate but robust (logic), and autonomous latent routing without structural hints (text). This progression illuminates how the interplay between a task-invariant recurrent reasoning core and task-specific perceptual interfaces shapes out-of-distribution (OOD) generalization, offering a mechanistic perspective on vertical chain-of-thought that complements the prevailing horizontal token-generation paradigm.
0
cs.CVcs.AI Ryosuke Sonoda, Ramya Srinivasan · Mar 23, 2026

This work addresses zero-shot detection of AI-generated images by measuring how Vision Foundation Model (VFM) representations respond to structured high-frequency perturbations. The core idea is that synthetic images contain characteristic frequency biases, causing their embeddings to shift differently than real images when high-frequency noise is applied to local patches. The method achieves strong detection accuracy while requiring only a single Fourier transform and one forward pass, making it one to two orders of magnitude faster than comparable training-free approaches.

The rapid progress of text-to-image models has made AI-generated images increasingly realistic, posing significant challenges for accurate detection of generated content. While training-based detectors often suffer from limited generalization to unseen images, training-free approaches offer better robustness, yet struggle to capture subtle discrepancies between real and synthetic images. In this work, we propose a training-free AI-generated image detection method that measures representation sensitivity to structured frequency perturbations, enabling detection of minute manipulations. The proposed method is computationally lightweight, as perturbation generation requires only a single Fourier transform for an input image. As a result, it achieves one to two orders of magnitude faster inference than most training-free detectors.Extensive experiments on challenging benchmarks demonstrate the efficacy of our method over state-of-the-art (SoTA). In particular, on OpenFake benchmark, our method improves AUC by nearly $10\%$ compared to SoTA, while maintaining substantially lower computational cost.
0
physics.opticscs.AIcs.AR Hyoseok Park, Yeonsang Park · Mar 23, 2026

Long-context LLM inference hits a memory wall: each decode step requires scanning the entire KV cache, incurring $O(n)$ memory bandwidth that cannot be solved by faster arithmetic. PRISM proposes a thin-film lithium niobate photonic accelerator that performs the block-selection similarity search in $O(1)$ optical latency using a broadcast-and-weight architecture, eliminating the $O(n)$ scan entirely. The work claims $16\times$–$32\times$ traffic reduction at 64K–128K tokens and a four-order-of-magnitude energy advantage over GPU baselines by matching photonic hardware capabilities—passive query broadcast, quasi-static microring weights, and low-precision rank output—to the selection task.

Long-context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache at every decode step -- a wall that no amount of arithmetic scaling can break. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation; however, these approaches inherit the same O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block-selection step: a memory-bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast-and-weight paradigm -- the query fans out to all candidates via passive splitting, signatures are quasi-static (matching electro-optic MRR programming), and only rank order matters (relaxing precision to 4-6 bits). Crucially, the photonic advantage grows with context length: as N increases, the electronic scan cost rises linearly while the photonic evaluation remains O(1). We instantiate this insight in PRISM (Photonic Ranking via Inner-product Similarity with Microring weights), a thin-film lithium niobate (TFLN) similarity engine. Hardware-impaired needle-in-a-haystack evaluation on Qwen2.5-7B confirms 100% accuracy from 4K through 64K tokens at k=32, with 16x traffic reduction at 64K context. PRISM achieves a four-order-of-magnitude energy advantage over GPU baselines at practical context lengths (n >= 4K).
0
cs.CVcs.AI Yi Wang, Haofei Zhang, Qihan Huang et al. · Mar 23, 2026

MetaCompress addresses token reduction for multi-turn VQA in Large Vision-Language Models, where future questions are unpredictable and may target any image region. The paper proposes a learning-based prompt-agnostic compression module trained via KL divergence minimization between original and compressed outputs, demonstrating that heuristic attention-based pruning is suboptimal for this scenario. The method achieves strong efficiency-accuracy trade-offs across five LVLM architectures while training on only ~20k samples.

Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at https://github.com/MArSha1147/MetaCompress.
0
cs.CVcs.AIcs.GR Kangbo Zhao, Miaoxin Guan, Xiang Chen et al. · Mar 23, 2026

This paper tackles the domain generalization problem in image deraining, where models trained on synthetic data fail catastrophically on out-of-distribution (OOD) real-world scenarios. The authors propose a three-stage pipeline—Superpixel Generation, Resolution-adaptive Fusion, and Pseudo-label Re-Synthesis—that adapts source-domain models to target domains using only unpaired rain-free images, eliminating the need for costly paired rainy data collection.

Image deraining plays a pivotal role in low-level computer vision, serving as a prerequisite for robust outdoor surveillance and autonomous driving systems. While deep learning paradigms have achieved remarkable success in firmly aligned settings, they often suffer from severe performance degradation when generalized to unseen Out-of-Distribution (OOD) scenarios. This failure stems primarily from the significant domain discrepancy between synthetic training datasets and the complex physical dynamics of real-world rain. To address these challenges, this paper proposes a pioneering cross-scenario deraining adaptation framework. Diverging from conventional approaches, our method obviates the requirements for paired rainy observations in the target domain, leveraging exclusively rain-free background images. We design a Superpixel Generation (Sup-Gen) module to extract stable structural priors from the source domain using Simple Linear Iterative Clustering. Subsequently, a Resolution-adaptive Fusion strategy is introduced to align these source structures with target backgrounds through texture similarity, ensuring the synthesis of diverse and realistic pseudo-data. Finally, we implement a pseudo-label re-Synthesize mechanism that employs multi-stage noise generation to simulate realistic rain streaks. This framework functions as a versatile plug-and-play module capable of seamless integration into arbitrary deraining architectures. Extensive experiments on state-of-the-art models demonstrate that our approach yields remarkable PSNR gains of up to 32% to 59% in OOD domains while significantly accelerating training convergence.
0
cs.AIcs.CL Yiling Wu · Mar 23, 2026

This paper examines representation genesis—the transition from non-representational physical systems to those with content-manipulable states. It argues that major frameworks in philosophy of mind (Language of Thought, teleosemantics, predictive processing, enactivism, and genetic phenomenology) share a 'Representation Presupposition' structure that prevents them from explaining this first acquisition without circularity. With large language models now achieving high cognitive performance without clear genesis events, the absence of a satisfactory theory becomes urgent.

Large language models are the first systems to achieve high cognitive performance without clearly undergoing representation genesis: the transition from a non-representing physical system to one whose states guide behavior in a content-sensitive way. Prior cognitive systems had already made this transition before we could examine it, and philosophy of mind treated genesis as a background condition rather than an explanatory target. LLMs provide a case that does not clearly involve this transition, making the genesis question newly urgent: if genesis did not occur, which cognitive capacities are affected, and why? We currently lack the conceptual resources to answer this. The reason, this paper argues, is structural. Major frameworks in philosophy of mind, including the Language of Thought hypothesis, teleosemantics, predictive processing, enactivism, and genetic phenomenology, share a common feature when applied to the genesis question: at some explanatory step, each deploys concepts whose explanatory purchase depends on the system already being organized as a representer. This pattern, which we call the Representation Presupposition structure, generates systematic explanatory deferral. Attempts to explain the first acquisition of content-manipulable representation within the existing categorical vocabulary import resources from the representational side of the transition itself. We call this the Representation Regress. The paper offers a conceptual diagnosis rather than a new theory, establishing the structure of the problem and deriving two minimum adequacy conditions for any account that avoids this pattern. LLMs make the absence of such a theory consequential rather than merely theoretical.
0
cs.CRcs.AI Yanming Mu, Hao Hu, Feiyang Li et al. · Mar 23, 2026

Retrieval-Augmented Generation (RAG) systems mitigate large language model hallucinations by integrating external knowledge bases, yet this multi-module architecture introduces complex security vulnerabilities spanning data poisoning, membership inference, and adversarial manipulation. This survey systematically maps threats across the RAG pipeline—vector database construction, retrieval, and generation—and categorizes corresponding defenses from input-side access control to output-side privacy preservation. As a comprehensive review of 152 papers, it aims to unify the analysis of threat models, defense mechanisms, and evaluation benchmarks to foster trustworthy RAG deployments in sensitive domains.

Retrieval-Augmented Generation (RAG) significantly mitigates the hallucinations and domain knowledge deficiency in large language models by incorporating external knowledge bases. However, the multi-module architecture of RAG introduces complex system-level security vulnerabilities. Guided by the RAG workflow, this paper analyzes the underlying vulnerability mechanisms and systematically categorizes core threat vectors such as data poisoning, adversarial attacks, and membership inference attacks. Based on this threat assessment, we construct a taxonomy of RAG defense technologies from a dual perspective encompassing both input and output stages. The input-side analysis reviews data protection mechanisms including dynamic access control, homomorphic encryption retrieval, and adversarial pre-filtering. The output-side examination summarizes advanced leakage prevention techniques such as federated learning isolation, differential privacy perturbation, and lightweight data sanitization. To establish a unified benchmark for future experimental design, we consolidate authoritative test datasets, security standards, and evaluation frameworks. To the best of our knowledge, this paper presents the first end-to-end survey dedicated to the security of RAG systems. Distinct from existing literature that isolates specific vulnerabilities, we systematically map the entire pipeline-providing a unified analysis of threat models, defense mechanisms, and evaluation benchmarks. By enabling deep insights into potential risks, this work seeks to foster the development of highly robust and trustworthy next-generation RAG systems.
0
cs.IRcs.AI Tianyi Li, Zixuan Wang, Guidong Lei et al. · Mar 23, 2026

AgenticRec attacks a key gap in LLM-based recommenders: existing agents rely on frozen reasoning chains and cannot learn from ranking feedback to refine tool use. The paper proposes a two-stage training framework that combines ReAct-style tool invocation with list-wise Group Relative Policy Optimization (GRPO) and Progressive Preference Refinement (PPR) for hard-negative mining. The work matters because it demonstrates that end-to-end reinforcement learning can align multi-step tool use with ranking objectives, moving beyond prompt-engineered agent workflows.

Recommender agents built on Large Language Models offer a promising paradigm for recommendation. However, existing recommender agents typically suffer from a disconnect between intermediate reasoning and final ranking feedback, and are unable to capture fine-grained preferences. To address this, we present AgenticRec, a ranking-oriented agentic recommendation framework that optimizes the entire decision-making trajectory (including intermediate reasoning, tool invocation, and final ranking list generation) under sparse implicit feedback. Our approach makes three key contributions. First, we design a suite of recommendation-specific tools integrated into a ReAct loop to support evidence-grounded reasoning. Second, we propose theoretically unbiased List-Wise Group Relative Policy Optimization (list-wise GRPO) to maximize ranking utility, ensuring accurate credit assignment for complex tool-use trajectories. Third, we introduce Progressive Preference Refinement (PPR) to resolve fine-grained preference ambiguities. By mining hard negatives from ranking violations and applying bidirectional preference alignment, PPR minimizes the convex upper bound of pairwise ranking errors. Experiments on benchmarks confirm that AgenticRec significantly outperforms baselines, validating the necessity of unifying reasoning, tool use, and ranking optimization.
0
cs.LGcs.AI Yawen Li, Tao Hu, Zhouhui Lian et al. · Mar 23, 2026

This paper studies generalization error bounds for Transformer models using offset Rademacher complexity. The core idea is to derive sharp excess risk bounds that achieve optimal $O(1/n)$ convergence rates---improving upon the standard $O(1/\sqrt{n})$---for single-head, multi-head, and multi-layer architectures, with explicit dependence on matrix ranks and parameter norms. The authors further extend these results to unbounded sub-Gaussian and heavy-tailed input distributions, broadening the applicability beyond standard boundedness assumptions.

This paper studies generalization error bounds for Transformer models. Based on the offset Rademacher complexity, we derive sharper generalization bounds for different Transformer architectures, including single-layer single-head, single-layer multi-head, and multi-layer Transformers. We first express the excess risk of Transformers in terms of the offset Rademacher complexity. By exploiting its connection with the empirical covering numbers of the corresponding hypothesis spaces, we obtain excess risk bounds that achieve optimal convergence rates up to constant factors. We then derive refined excess risk bounds by upper bounding the covering numbers of Transformer hypothesis spaces using matrix ranks and matrix norms, leading to precise, architecture-dependent generalization bounds. Finally, we relax the boundedness assumption on feature mappings and extend our theoretical results to settings with unbounded (sub-Gaussian) features and heavy-tailed distributions.
0
cs.AI Jinhui Ren, Huaiming Li, Yabin Liu et al. · Mar 23, 2026

High-fidelity CFD for vehicle aerodynamic drag is bottlenecked not by solver wall time but by workflow friction—CAD cleanup, meshing retries, and queue contention. This paper proposes a contract-centric blueprint where self-evolving coding agents search over executable surrogate programs (not static models) to predict drag coefficient $C_d$ under industrial constraints. The system combines Famou-Agent-style evaluator feedback with population-based island evolution and hard evaluation contracts that enforce leakage prevention, deterministic replay, and resource budgets, aiming for a screen-and-escalate deployment where uncertain cases trigger automatic fallback to high-fidelity CFD.

High-fidelity vehicle drag evaluation is constrained less by solver runtime than by workflow friction: geometry cleanup, meshing retries, queue contention, and reproducibility failures across teams. We present a contract-centric blueprint for self-evolving coding agents that discover executable surrogate pipelines for predicting drag coefficient $C_d$ under industrial constraints. The method formulates surrogate discovery as constrained optimization over programs, not static model instances, and combines Famou-Agent-style evaluator feedback with population-based island evolution, structured mutations (data, model, loss, and split policies), and multi-objective selection balancing ranking quality, stability, and cost. A hard evaluation contract enforces leakage prevention, deterministic replay, multi-seed robustness, and resource budgets before any candidate is admitted. Across eight anonymized evolutionary operators, the best system reaches a Combined Score of 0.9335 with sign-accuracy 0.9180, while trajectory and ablation analyses show that adaptive sampling and island migration are primary drivers of convergence quality. The deployment model is explicitly ``screen-and-escalate'': surrogates provide high-throughput ranking for design exploration, but low-confidence or out-of-distribution cases are automatically escalated to high-fidelity CFD. The resulting contribution is an auditable, reusable workflow for accelerating aerodynamic design iteration while preserving decision-grade reliability, governance traceability, and safety boundaries.
0
cs.LGcs.AIstat.ML Abdou-Raouf Atarmla · Mar 23, 2026

Rule-State Inference (RSI) addresses compliance monitoring in domains like taxation where authoritative rules are known a priori but observations are partial, noisy, or strategically distorted. The paper proposes a Bayesian framework that inverts the standard ML paradigm: instead of learning rules from data, RSI encodes regulatory rules as structured priors and infers latent rule states (activation, compliance rate, parametric drift) via posterior inference. This enables zero-shot compliance assessment without labeled training data—a critical capability for low-resource environments where non-compliance labels are scarce or legally sensitive.

Existing machine learning frameworks for compliance monitoring -- Markov Logic Networks, Probabilistic Soft Logic, supervised models -- share a fundamental paradigm: they treat observed data as ground truth and attempt to approximate rules from it. This assumption breaks down in rule-governed domains such as taxation or regulatory compliance, where authoritative rules are known a priori and the true challenge is to infer the latent state of rule activation, compliance, and parametric drift from partial and noisy observations. We propose Rule-State Inference (RSI), a Bayesian framework that inverts this paradigm by encoding regulatory rules as structured priors and casting compliance monitoring as posterior inference over a latent rule-state space S = {(a_i, c_i, delta_i)}, where a_i captures rule activation, c_i models the compliance rate, and delta_i quantifies parametric drift. We prove three theoretical guarantees: (T1) RSI absorbs regulatory changes in O(1) time via a prior ratio correction, independently of dataset size; (T2) the posterior is Bernstein-von Mises consistent, converging to the true rule state as observations accumulate; (T3) mean-field variational inference monotonically maximizes the Evidence Lower BOund (ELBO). We instantiate RSI on the Togolese fiscal system and introduce RSI-Togo-Fiscal-Synthetic v1.0, a benchmark of 2,000 synthetic enterprises grounded in real OTR regulatory rules (2022-2025). Without any labeled training data, RSI achieves F1=0.519 and AUC=0.599, while absorbing regulatory changes in under 1ms versus 683-1082ms for full model retraining -- at least a 600x speedup.
0
cs.SEcs.AI Yujia Chen, Yingli Zhou, Fangyuan Zhang et al. · Mar 23, 2026

MIST addresses the challenge of generating high-quality SQL test cases for Database Management Systems using lightweight Large Language Models. The framework combines a feature-guided synthesis stage that leverages hierarchical documentation structures with error feedback, and a Monte Carlo Tree Search-based mutation stage to overcome coverage plateaus. This two-pronged approach aims to achieve high code coverage in resource-constrained industrial environments where only small LLMs can be deployed locally.

Database Management Systems (DBMSs) are fundamental infrastructure for modern data-driven applications, where thorough testing with high-quality SQL test cases is essential for ensuring system reliability. Traditional approaches such as fuzzing can be effective for specific DBMSs, but adapting them to different proprietary dialects requires substantial manual effort. Large Language Models (LLMs) present promising opportunities for automated SQL test generation, but face critical challenges in industrial environments. First, lightweight models are widely used in organizations due to security and privacy constraints, but they struggle to generate syntactically valid queries for proprietary SQL dialects. Second, LLM-generated queries are often semantically similar and exercise only shallow execution paths, thereby quickly reaching a coverage plateau. To address these challenges, we propose MIST, an LLM-based test case generatIon framework for DBMS through Monte Carlo Tree search. MIST consists of two stages: Feature-Guided Error-Driven Test Case Synthetization, which constructs a hierarchical feature tree and uses error feedback to guide LLM generation, aiming to produce syntactically valid and semantically diverse queries for different DBMS dialects, and Monte Carlo Tree Search-Based Test Case Mutation, which jointly optimizes seed query selection and mutation rule application guided by coverage feedback, aiming at boosting code coverage by exploring deeper execution paths. Experiments on three widely-used DBMSs with four lightweight LLMs show that MIST achieves average improvements of 43.3% in line coverage, 32.3% in function coverage, and 46.4% in branch coverage compared to the baseline approach with the highest line coverage of 69.3% in the Optimizer module.
0
cs.ROcs.AI Shivani Kamtikar, Kendall Koe, Justin Wasserman et al. · Mar 22, 2026

This paper addresses robotic reaching in cluttered, unseen environments using a hybrid rigid-soft continuum manipulator. The core idea is a real-time pipeline that combines multi-view RGB reconstruction (Mast3r), open-world object detection (YOLO-World), shape-aware RRT* planning with asymmetric collision constraints, and a learned controller trained on pose-to-actuation data. If validated at scale, this could enable robots to navigate dense foliage or disaster debris where rigid arms fail and pure soft arms lack reach.

As robotic systems increasingly operate in unstructured, cluttered, and previously unseen environments, there is a growing need for manipulators that combine compliance, adaptability, and precise control. This work presents a real-time hybrid rigid-soft continuum manipulator system designed for robust open-world object reaching in such challenging environments. The system integrates vision-based perception and 3D scene reconstruction with shape-aware motion planning to generate safe trajectories. A learning-based controller drives the hybrid arm to arbitrary target poses, leveraging the flexibility of the soft segment while maintaining the precision of the rigid segment. The system operates without environment-specific retraining, enabling direct generalization to new scenes. Extensive real-world experiments demonstrate consistent reaching performance with errors below 2 cm across diverse cluttered setups, highlighting the potential of hybrid manipulators for adaptive and reliable operation in unstructured environments.
0
cs.AIcs.CL Yiliang Song, Hongjun An, Jiangan Chen et al. · Mar 23, 2026

The paper critiques the institutionalization of LLM benchmarks as "Silicon Bureaucracy" and "AI Test-Oriented Education", arguing high scores often conflate exam-oriented competence with genuine generalization due to data contamination. It proposes an audit framework using a router-worker setup: clean-control routers transmit full questions while noisy routers delete, rewrite, and perturb before aggregation. For clean benchmarks, noisy aggregation should not systematically exceed the baseline; persistent above-baseline gains suggest contamination-related memory activation. The core finding—that 10 of 12 models exceed clean baselines under multi-router noisy conditions—challenges the interpretability of raw benchmark scores.

Public benchmarks increasingly govern how large language models (LLMs) are ranked, selected, and deployed. We frame this benchmark-centered regime as Silicon Bureaucracy and AI Test-Oriented Education, and argue that it rests on a fragile assumption: that benchmark scores directly reflect genuine generalization. In practice, however, such scores may conflate exam-oriented competence with principled capability, especially when contamination and semantic leakage are difficult to exclude from modern training pipelines. We therefore propose an audit framework for analyzing contamination sensitivity and score confidence in LLM benchmarks. Using a router-worker setup, we compare a clean-control condition with noisy conditions in which benchmark problems are systematically deleted, rewritten, and perturbed before being passed downstream. For a genuinely clean benchmark, noisy conditions should not consistently outperform the clean-control baseline. Yet across multiple models, we find widespread but heterogeneous above-baseline gains under noisy conditions, indicating that benchmark-related cues may be reassembled and can reactivate contamination-related memory. These results suggest that similar benchmark scores may carry substantially different levels of confidence. Rather than rejecting benchmarks altogether, we argue that benchmark-based evaluation should be supplemented with explicit audits of contamination sensitivity and score confidence.
0
eess.AScs.AIcs.SD Tianyu Cao, Helin Wang, Ari Frummer et al. · Mar 23, 2026

DiT-Flow tackles multi-condition speech enhancement (noise, reverberation, codec compression) by combining flow matching with a latent Diffusion Transformer (DiT) backbone. The paper proposes operating flow matching in a VAE-compressed latent space for efficiency, introduces StillSonicSet (a synthetic dataset with realistic room acoustics for stationary sources), and applies Mixture-of-LoRA-Experts (MoELoRA) for parameter-efficient adaptation to unseen distortions. The work matters because most SE models fail when deployed on real-world audio with compound distortions unseen during training.

Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To address this, we propose DiT-Flow, a flow matching-based SE framework built on the latent Diffusion Transformer (DiT) backbone and trained for robustness across diverse distortions, including noise, reverberation, and compression. DiT-Flow operates on compact variational auto-encoders (VAEs)-derived latent features. We validated our approach on StillSonicSet, a synthetic yet acoustically realistic dataset composed of LibriSpeech, FSD50K, FMA, and 90 Matterport3D scenes. Experiments show that DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching in multi-condition speech enhancement. Despite ongoing efforts to expand synthetic data realism, a persistent bottleneck in SE is the inevitable mismatch between training and deployment conditions. By integrating LoRA with the MoE framework, we achieve both parameter-efficient and high-performance training for DiT-Flow robust to multiple distortions with using 4.9% percentage of the total parameters to obtain a better performance on five unseen distortions.
0
cs.CLcs.AI Jiayi Geng, Graham Neubig · Mar 23, 2026

CAID tackles long-horizon software engineering tasks where single agents struggle with accuracy and wall-clock time. The core idea is Centralized Asynchronous Isolated Delegation: a manager decomposes tasks into dependency graphs and delegates to multiple engineer agents working in isolated git worktrees, integrating progress via branch-and-merge. The system improves accuracy by 26.7% absolute on PaperBench and 14.3% on Commit0, demonstrating that structured coordination grounded in SWE primitives outperforms simply scaling single-agent iteration budgets.

AI agents have become increasingly capable at isolated software engineering (SWE) tasks such as resolving issues on Github. Yet long-horizon tasks involving multiple interdependent subtasks still pose challenges both with respect to accuracy, and with respect to timely completion. A natural approach to solving these long-horizon tasks in a timely manner is asynchronous multi-agent collaboration, where multiple agents work on different parts of the task at the same time. But effective application of multi-agent systems has proven surprisingly difficult: concurrent edits by multiple agents interfere with each other, dependencies are difficult to synchronize, and combining partial progress into a coherent whole is challenging. On the other hand, human developers have long relied on mature collaboration infrastructure to manage these challenges in large software projects. Inspired by these collaboration primitives, we introduce Centralized Asynchronous Isolated Delegation (CAID), a structured multi-agent coordination paradigm grounded in three core SWE primitives: centralized task delegation, asynchronous execution, and isolated workspaces. CAID constructs dependency-aware task plans through a central manager, executes subtasks concurrently in isolated workspaces, and consolidates progress via structured integration with executable test-based verification. In empirical evaluation, we find that CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on Python library development tasks (Commit0). Through systematic analysis, we find that branch-and-merge is a central coordination mechanism for multi-agent collaboration, and that SWE primitives such as git worktree, git commit, and git merge enable it to be realized in a reliable and executable manner.
0
cs.AIcs.CL Liang Ding · Mar 22, 2026

AdaRubric solves the static-rubric bottleneck in LLM-as-Judge evaluation by dynamically generating task-specific evaluation dimensions from task descriptions. It scores agent trajectories step-by-step with confidence-weighted per-dimension feedback and filters preference pairs using the DimensionAwareFilter—a provably necessary mechanism to prevent high-scoring dimensions from masking failures. The approach achieves Pearson $r=0.79$ correlation with human judgments and yields substantial downstream gains: +6.8–8.5 percentage points in DPO task success and +6.6 pp faster PPO convergence at 5K steps.

LLM-as-Judge evaluation fails agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency. We present ADARUBRIC, which closes this gap by generating task-specific evaluation rubrics on the fly from task descriptions, scoring trajectories step-by-step with confidence-weighted per-dimension feedback, and filtering preference pairs with the novel DimensionAwareFilter - a provably necessary condition for preventing high-scoring dimensions from masking dimension-level failures. On WebArena and ToolBench, ADARUBRIC achieves Pearson r=0.79 human correlation (+0.16 over the best static baseline) with deployment-grade reliability (Krippendorff's $\alpha$=0.83). DPO agents trained on ADARUBRIC preference pairs gain +6.8 to +8.5 pp task success over Prometheus across three benchmarks; gains transfer to SWE-bench code repair (+4.9 pp) and accelerate PPO convergence by +6.6 pp at 5K steps - both without any rubric engineering. Code: https://github.com/alphadl/AdaRubrics.
0
cs.AI Zhongyi Li, Wan Tian, Jingyu Chen et al. · Mar 23, 2026

The paper tackles instability in multi-agent reinforcement learning for LLM reasoning, where noisy, heavy-tailed rewards break standard GRPO batch-mean normalization. It proposes DACR, a structured Answer-Critique-Rewrite protocol with cross-improvement rewards, and ARE, a robust estimator that replaces empirical means with a Median-of-Means variant using adaptive losses. Experiments on mathematical reasoning and aerial vision-language navigation demonstrate improved accuracy and training stability under synthetic noise contamination.

Multi-agent collaboration has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models, yet it suffers from interaction-level ambiguity that blurs generation, critique, and revision, making credit assignment across agents difficult. Moreover, policy optimization in this setting is vulnerable to heavy-tailed and noisy rewards, which can bias advantage estimation and trigger unstable or even divergent training. To address both issues, we propose a robust multi-agent reinforcement learning framework for collaborative reasoning, consisting of two components: Dual-Agent Answer-Critique-Rewrite (DACR) and an Adaptive Robust Estimator (ARE). DACR decomposes reasoning into a structured three-stage pipeline: answer, critique, and rewrite, while enabling explicit attribution of each agent's marginal contribution to its partner's performance. ARE provides robust estimation of batch experience means during multi-agent policy optimization. Across mathematical reasoning and embodied intelligence benchmarks, even under noisy rewards, our method consistently outperforms the baseline in both homogeneous and heterogeneous settings. These results indicate stronger robustness to reward noise and more stable training dynamics, effectively preventing optimization failures caused by noisy reward signals.