Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
199 papers in cs.AI
Trending mixes fresh papers with community signal.
0
cs.AI Aryan Kasat, Smriti Singh, Aman Chadha et al. · Mar 23, 2026

This paper investigates whether LLMs exhibit genuine moral reasoning or merely produce convincing moral rhetoric through a large-scale empirical study of 13 models across 6 classical moral dilemmas. Using Kohlberg's stages of moral development as a diagnostic framework, the authors evaluate whether model outputs track human developmental patterns or reflect alignment training artifacts. The core finding is "moral ventriloquism" — the hypothesis that models acquire post-conventional moral language through RLHF without the underlying cognitive architecture, evidenced by distributional inversions (86% Stages 5-6 vs. human Stage 4 dominance), near-robotic cross-dilemma consistency (ICC > 0.90), and "moral decoupling" where stated justifications misalign with action choices.

Do large language models reason morally, or do they merely sound like they do? We investigate whether LLM responses to moral dilemmas exhibit genuine developmental progression through Kohlberg's stages of moral development, or whether alignment training instead produces reasoning-like outputs that superficially resemble mature moral judgment without the underlying developmental trajectory. Using an LLM-as-judge scoring pipeline validated across three judge models, we classify more than 600 responses from 13 LLMs spanning a range of architectures, parameter scales, and training regimes across six classical moral dilemmas, and conduct ten complementary analyses to characterize the nature and internal coherence of the resulting patterns. Our results reveal a striking inversion: responses overwhelmingly correspond to post-conventional reasoning (Stages 5-6) regardless of model size, architecture, or prompting strategy, the effective inverse of human developmental norms, where Stage 4 dominates. Most strikingly, a subset of models exhibit moral decoupling: systematic inconsistency between stated moral justification and action choice, a form of logical incoherence that persists across scale and prompting strategy and represents a direct reasoning consistency failure independent of rhetorical sophistication. Model scale carries a statistically significant but practically small effect; training type has no significant independent main effect; and models exhibit near-robotic cross-dilemma consistency producing logically indistinguishable responses across semantically distinct moral problems. We posit that these patterns constitute evidence for moral ventriloquism: the acquisition, through alignment training, of the rhetorical conventions of mature moral reasoning without the underlying developmental trajectory those conventions are meant to represent.
0
cs.CRcs.AIcs.LG Tom Biskupski, Stephan Kleber · Mar 23, 2026

Evaluating LLM outputs at scale remains a bottleneck for deploying safe AI systems. This paper conducts a comprehensive empirical study of 37 conversational LLMs serving as automated judges across eight security and quality assessment tasks. The work identifies viable open-source alternatives to GPT-4o for judgment tasks while demonstrating that popular techniques like second-level judging and specialized evaluator models underperform compared to well-prompted general models.

A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models' use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight different categories of judgment tasks and the corresponding ground-truth labels based on human assessments. Our empirical results show a high correlation of LLMs as judges with human assessments, when combined with a suitable prompt, in particular for GPT-4o, several open-source models with $\geqslant$ 32B parameters, and a few smaller models like Qwen2.5 14B.
0
cs.AIcs.DCcs.SE Neelmani Vispute · Mar 23, 2026

As AI agents move from human-supervised copilots to fully autonomous infrastructure, organizations face a critical observability gap: existing systems capture computational state and execution traces but lack structured records of the agent's reasoning. This paper introduces the Agent Execution Record (AER), a schema-level primitive that captures intent, observation, and inference as first-class queryable fields at execution time. The core claim is that reasoning provenance cannot be faithfully reconstructed from state checkpoints due to fundamental non-identifiability (intent multiplicity, observation ambiguity, inference volatility). If validated, AERs would enable population-level behavioral analytics—systematic comparison of reasoning patterns across thousands of investigations, confidence calibration against expert judgments, and counterfactual regression testing via mock replay—that existing tooling achieves only through fragile post-hoc extraction.

As AI agents transition from human-supervised copilots to autonomous platform infrastructure, the ability to analyze their reasoning behavior across populations of investigations becomes a pressing infrastructure requirement. Existing operational tooling addresses adjacent needs effectively: state checkpoint systems enable fault tolerance; observability platforms provide execution traces for debugging; telemetry standards ensure interoperability. What current systems do not natively provide as a first-class, schema-level primitive is structured reasoning provenance -- normalized, queryable records of why the agent chose each action, what it concluded from each observation, how each conclusion shaped its strategy, and which evidence supports its final verdict. This paper introduces the Agent Execution Record (AER), a structured reasoning provenance primitive that captures intent, observation, and inference as first-class queryable fields on every step, alongside versioned plans with revision rationale, evidence chains, structured verdicts with confidence scores, and delegation authority chains. We formalize the distinction between computational state persistence and reasoning provenance, argue that the latter cannot in general be faithfully reconstructed from the former, and show how AERs enable population-level behavioral analytics: reasoning pattern mining, confidence calibration, cross-agent comparison, and counterfactual regression testing via mock replay. We present a domain-agnostic model with extensible domain profiles, a reference implementation and SDK, and outline an evaluation methodology informed by preliminary deployment on a production platformized root cause analysis agent.
0
cs.CVcs.AIcs.CL Umair Nawaz, Ahmed Heakl, Ufaq Khan et al. · Mar 23, 2026

WorldCache addresses the prohibitive latency of Diffusion Transformers (DiTs) for video world models by replacing static feature caching with a content-aware dynamical approximation framework. The method introduces motion-adaptive thresholds, saliency-weighted drift estimation, and optimal feature blending to eliminate ghosting artifacts during fast motion. Achieving 2.3× speedup on Cosmos-Predict2.5 with 99.4% quality retention, it offers a training-free path toward interactive world simulation.

Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existing methods largely rely on a Zero-Order Hold assumption i.e., reusing cached features as static snapshots when global drift is small. This often leads to ghosting artifacts, blur, and motion inconsistencies in dynamic scenes. We propose \textbf{WorldCache}, a Perception-Constrained Dynamical Caching framework that improves both when and how to reuse features. WorldCache introduces motion-adaptive thresholds, saliency-weighted drift estimation, optimal approximation via blending and warping, and phase-aware threshold scheduling across diffusion steps. Our cohesive approach enables adaptive, motion-consistent feature reuse without retraining. On Cosmos-Predict2.5-2B evaluated on PAI-Bench, WorldCache achieves \textbf{2.3$\times$} inference speedup while preserving \textbf{99.4\%} of baseline quality, substantially outperforming prior training-free caching approaches. Our code can be accessed on \href{https://umair1221.github.io/World-Cache/}{World-Cache}.
0
cs.AIecon.GNq-fin.EC Yicai Xing · Mar 23, 2026

This paper proposes that AI inference tokens are evolving into a standardized commodity like electricity, and designs a complete futures market framework including the "Standard Inference Token" (SIT) contract, settlement mechanisms, and margin systems. The core motivation is hedging compute cost risk for application-layer enterprises as inference displaces training as the dominant AI cost.

As large language models (LLMs) and vision-language-action models (VLAs) become widely deployed, the tokens consumed by AI inference are evolving into a new type of commodity. This paper systematically analyzes the commodity attributes of tokens, arguing for their transition from intelligent service outputs to compute infrastructure raw materials, and draws comparisons with established commodities such as electricity, carbon emission allowances, and bandwidth. Building on the historical experience of electricity futures markets and the theory of commodity financialization, we propose a complete design for standardized token futures contracts, including the definition of a Standard Inference Token (SIT), contract specifications, settlement mechanisms, margin systems, and market-maker regimes. By constructing a mean-reverting jump-diffusion stochastic process model and conducting Monte Carlo simulations, we evaluate the hedging efficiency of the proposed futures contracts for application-layer enterprises. Simulation results show that, under an application-layer demand explosion scenario, token futures can reduce enterprise compute cost volatility by 62%-78%. We also explore the feasibility of GPU compute futures and discuss the regulatory framework for token futures markets, providing a theoretical foundation and practical roadmap for the financialization of compute resources.
0
cs.LGcs.AI Cristian P\'erez-Corral, Alberto Fern\'andez-Hern\'andez, Jose I. Mestre et al. · Mar 23, 2026

This work attacks the friction between smooth GELU training (ubiquitous in Transformers) and piecewise-linear deployment pipelines (quantization, formal verification). The authors parametrize GELU as $f(x;\lambda) = x\Phi(\lambda x)$ with learnable sharpness $\lambda \geq 1$, deriving a principled annealing target from an $\ell_1$ approximation bound to the Heaviside step. While the hardening protocol reduces validation-drop upon ReLU substitution in vision and tabular tasks, the 25% annealing switch is heuristic and actual downstream benefits in integer-only inference or SMT verification remain unevaluated.

Gaussian Error Linear Unit (GELU) is a widely used smooth alternative to Rectifier Linear Unit (ReLU), yet many deployment, compression, and analysis toolchains are most naturally expressed for piecewise-linear (ReLU-type) networks. We study a hardness-parameterized formulation of GELU, f(x;{\lambda})=x{\Phi}({\lambda} x), where {\Phi} is the Gaussian CDF and {\lambda} \in [1, infty) controls gate sharpness, with the goal of turning smooth gated training into a controlled path toward ReLU-compatible models. Learning {\lambda} is non-trivial: naive updates yield unstable dynamics and effective gradient attenuation, so we introduce a constrained reparameterization and an optimizer-aware update scheme. Empirically, across a diverse set of model--dataset pairs spanning MLPs, CNNs, and Transformers, we observe structured layerwise hardness profiles and assess their robustness under different initializations. We further study a deterministic ReLU-ization strategy in which the learned gates are progressively hardened toward a principled target, enabling a post-training substitution of {\lambda}-GELU by ReLU with reduced disruption. Overall, {\lambda}-GELU provides a minimal and interpretable knob to profile and control gating hardness, bridging smooth training with ReLU-centric downstream pipelines.
0
cs.CVcs.AI Linkuan Zhou, Yinghao Xia, Yufei Shen et al. · Mar 23, 2026

SHAPE addresses unsupervised domain adaptation for medical image segmentation, where models trained on one imaging modality (e.g., MRI) degrade sharply when applied to another (e.g., CT). The core innovation shifts the paradigm from pixel-level correctness to global anatomical plausibility through a DINOv3 foundation model, a Hierarchical Feature Modulation (HFM) module for class-aware alignment, and a Hypergraph Plausibility Estimation (HPE) pipeline that validates pseudo-labels using higher-order anatomical relationships. This matters for deploying robust clinical segmentation models across diverse imaging environments without costly manual re-annotation.

Unsupervised Domain Adaptation (UDA) is essential for deploying medical segmentation models across diverse clinical environments. Existing methods are fundamentally limited, suffering from semantically unaware feature alignment that results in poor distributional fidelity and from pseudo-label validation that disregards global anatomical constraints, thus failing to prevent the formation of globally implausible structures. To address these issues, we propose SHAPE (Structure-aware Hierarchical Unsupervised Domain Adaptation with Plausibility Evaluation), a framework that reframes adaptation towards global anatomical plausibility. Built on a DINOv3 foundation, its Hierarchical Feature Modulation (HFM) module first generates features with both high fidelity and class-awareness. This shifts the core challenge to robustly validating pseudo-labels. To augment conventional pixel-level validation, we introduce Hypergraph Plausibility Estimation (HPE), which leverages hypergraphs to assess the global anatomical plausibility that standard graphs cannot capture. This is complemented by Structural Anomaly Pruning (SAP) to purge remaining artifacts via cross-view stability. SHAPE significantly outperforms prior methods on cardiac and abdominal cross-modality benchmarks, achieving state-of-the-art average Dice scores of 90.08% (MRI->CT) and 78.51% (CT->MRI) on cardiac data, and 87.48% (MRI->CT) and 86.89% (CT->MRI) on abdominal data. The code is available at https://github.com/BioMedIA-repo/SHAPE.
0
cs.LGcs.AI Yuze Qin, Qingyong Li, Zhiqing Guo et al. · Mar 23, 2026

PW-FouCast addresses the degradation of radar-only precipitation nowcasting at long lead times by proposing a frequency-domain fusion framework that integrates Pangu-Weather foundation model priors with radar observations. The core insight is that meteorological forecasts and radar reflectivity share similar phase structure despite differing amplitudes, enabling spectral alignment through phase-aware modulation and memory-based correction. The approach achieves quantitative improvements on standard benchmarks and offers a novel alternative to spatial fusion methods.

Precipitation nowcasting is critical for disaster mitigation and aviation safety. However, radar-only models frequently suffer from a lack of large-scale atmospheric context, leading to performance degradation at longer lead times. While integrating meteorological variables predicted by weather foundation models offers a potential remedy, existing architectures fail to reconcile the profound representational heterogeneities between radar imagery and meteorological data. To bridge this gap, we propose PW-FouCast, a novel frequency-domain fusion framework that leverages Pangu-Weather forecasts as spectral priors within a Fourier-based backbone. Our architecture introduces three key innovations: (i) Pangu-Weather-guided Frequency Modulation to align spectral magnitudes and phases with meteorological priors; (ii) Frequency Memory to correct phase discrepancies and preserve temporal evolution; and (iii) Inverted Frequency Attention to reconstruct high-frequency details typically lost in spectral filtering. Extensive experiments on the SEVIR and MeteoNet benchmarks demonstrate that PW-FouCast achieves state-of-the-art performance, effectively extending the reliable forecast horizon while maintaining structural fidelity. Our code is available at https://github.com/Onemissed/PW-FouCast.
0
cs.CVcs.AIcs.GR Shivam Duggal, Xingjian Bai, Zongze Wu et al. · Mar 23, 2026

Traditional latent diffusion models require staging—first train a VAE tokenizer, freeze it, then train a diffusion model on top. UNITE proposes a single-stage approach where a shared "Generative Encoder" serves as both tokenizer and denoiser via weight sharing, achieving FID 1.73 on ImageNet 256×256 without adversarial losses or pretrained encoders like DINOv2.

Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is that tokenization and generation can be viewed as the same latent inference problem under different conditioning regimes: tokenization infers latents from fully observed images, whereas generation infers them from noise together with text or class conditioning. Motivated by this, we introduce a single-stage training procedure that jointly optimizes both tasks via two forward passes through the same Generative Encoder. The shared parameters enable gradients to jointly shape the latent space, encouraging a "common latent language". Across image and molecule modalities, UNITE achieves near state of the art performance without adversarial losses or pretrained encoders (e.g., DINO), reaching FID 2.12 and 1.73 for Base and Large models on ImageNet 256 x 256. We further analyze the Generative Encoder through the lenses of representation alignment and compression. These results show that single stage joint training of tokenization & generation from scratch is feasible.
0
cs.LGcs.AI Kexin Huang, Haoming Meng, Junkang Wu et al. · Mar 23, 2026

This paper investigates how Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning by focusing on the *direction* of policy updates rather than their magnitude. The authors introduce $\Delta \log p$, the signed log-probability difference between base and RLVR models, and argue it better captures reasoning-critical tokens than magnitude-based metrics like entropy or KL divergence. They validate this through token-replacement interventions and propose two practical applications: a test-time extrapolation method that amplifies the learned direction without additional training, and a training-time reweighting scheme that focuses learning on low-probability tokens.

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning capabilities of large language models. While existing analyses identify that RLVR-induced changes are sparse, they primarily focus on the \textbf{magnitude} of these updates, largely overlooking their \textbf{direction}. In this work, we argue that the direction of updates is a more critical lens for understanding RLVR's effects, which can be captured by the signed, token-level log probability difference $\Delta\log p$ between the base and final RLVR models. Through statistical analysis and token-replacement interventions, we demonstrate that $\Delta\log p$ more effectively identifies sparse, yet reasoning-critical updates than magnitude-based metrics (\eg divergence or entropy). Building on this insight, we propose two practical applications: (1) a \textit{test-time extrapolation} method that amplifies the policy along the learned $\Delta\log p$ direction to improve reasoning accuracy without further training; (2) a \textit{training-time reweighting} method that focuses learning on low-probability (corresponding to higher $\Delta\log p$) tokens, which improves reasoning performance across models and benchmarks. Our work establishes the direction of change as a key principle for analyzing and improving RLVR.
0
cs.LGcs.AIcs.CL Kexian Tang, Jiani Wang, Shaowen Wang et al. · Mar 23, 2026

Large language models often lack coverage in specialized, data-scarce domains where web text is limited. This paper proposes SPA (Scaling Prompt-engineered Augmentation), a baseline that generates large-scale synthetic corpora using just seven carefully designed prompt templates grounded in cognitive learning strategies (Concept Learning, Critical Thinking, and Generative Learning). The core finding is that this simple approach consistently outperforms complex RL-based methods like SEAL and multi-stage pipelines like EntiGraph across Wikipedia QA, long-document comprehension, and multi-hop reasoning benchmarks, suggesting that careful prompt design combined with straightforward scaling is surprisingly effective for knowledge injection.

While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at https://github.com/Tangkexian/SPA.
0
cs.AIcs.HC Susana Nunes, Tiago Guerreiro, Catia Pesquita · Mar 23, 2026

This paper tackles the limitation that XAI systems assume static user models, ignoring diverse epistemic stances among domain experts. The authors propose agentic personas—structured representations of expert reasoning strategies derived from clustered feedback and instantiated via LLMs—to condition reinforcement learning-based explanation generation on knowledge graphs. This enables adaptive explanations that align with specific interpretive preferences (mechanistic rigor vs. focused clarity) without requiring extensive individual-level human feedback, demonstrated in drug discovery with 22 expert participants.

AI explanation methods often assume a static user model, producing non-adaptive explanations regardless of expert goals, reasoning strategies, or decision contexts. Knowledge graph-based explanations, despite their capacity for grounded, path-based reasoning, inherit this limitation. In complex domains such as scientific discovery, this assumption fails to capture the diversity of cognitive strategies and epistemic stances among experts, preventing explanations that foster deeper understanding and informed decision-making. However, the scarcity of human experts limits the use of direct human feedback to produce adaptive explanations. We present a reinforcement learning approach for scientific explanation generation that incorporates agentic personas, structured representations of expert reasoning strategies, that guide the explanation agent towards specific epistemic preferences. In an evaluation of knowledge graph-based explanations for drug discovery, we tested two personas that capture distinct epistemic stances derived from expert feedback. Results show that persona-driven explanations match state-of-the-art predictive performance while persona preferences closely align with those of their corresponding experts. Adaptive explanations were consistently preferred over non-adaptive baselines (n = 22), and persona-based training reduces feedback requirements by two orders of magnitude. These findings demonstrate how agentic personas enable scalable adaptive explainability for AI systems in complex and high-stakes domains.
0
cs.CVcs.AIcs.LG Donald Shenaj, Federico Errica, Antonio Carta · Mar 23, 2026

Personalized image generation with diffusion models relies on Low-Rank Adaptation (LoRA) to fine-tune models efficiently, but current practice uses a fixed rank across all layers regardless of subject complexity. This paper proposes LoRA2, which learns adaptive ranks per LoRA component via a variational framework that imposes an importance ordering over rank indices using a discretized exponential distribution. The method achieves better subject fidelity and prompt alignment while using significantly less memory than high-rank baselines, addressing the combinatorial explosion of searching $S K^L$ architectural configurations.

Low Rank Adaptation (LoRA) is the de facto fine-tuning strategy to generate personalized images from pre-trained diffusion models. Choosing a good rank is extremely critical, since it trades off performance and memory consumption, but today the decision is often left to the community's consensus, regardless of the personalized subject's complexity. The reason is evident: the cost of selecting a good rank for each LoRA component is combinatorial, so we opt for practical shortcuts such as fixing the same rank for all components. In this paper, we take a first step to overcome this challenge. Inspired by variational methods that learn an adaptive width of neural networks, we let the ranks of each layer freely adapt during fine-tuning on a subject. We achieve it by imposing an ordering of importance on the rank's positions, effectively encouraging the creation of higher ranks when strictly needed. Qualitatively and quantitatively, our approach, LoRA$^2$, achieves a competitive trade-off between DINO, CLIP-I, and CLIP-T across 29 subjects while requiring much less memory and lower rank than high rank LoRA versions. Code: https://github.com/donaldssh/NotAllLayersAreCreatedEqual.
0
cs.LGcs.AI Dilina Rajapakse, Juan C. Rosero, Ivana Dusparic · Mar 23, 2026

Multi-Objective Reinforcement Learning (MORL) agents must balance competing objectives like speed versus energy consumption, yet existing Explainable RL methods fail to clarify how specific behavioral choices drive Pareto trade-offs. This paper proposes TREX, a post-hoc trajectory attribution framework that clusters agent behaviors into semantically meaningful segments and quantifies each cluster's influence on objective trade-offs by training complementary policies that exclude specific trajectory groups. The work addresses a genuine gap in explainability by moving beyond policy selection to reveal which behavioral patterns (such as "long leaps" versus "short strides") justify the agent's learned trade-off logic.

Reinforcement Learning (RL) has demonstrated its ability to solve complex decision-making problems in a variety of domains, by optimizing reward signals obtained through interaction with an environment. However, many real-world scenarios involve multiple, potentially conflicting objectives that cannot be easily represented by a single scalar reward. Multi-Objective Reinforcement Learning (MORL) addresses this limitation by enabling agents to optimize several objectives simultaneously, explicitly reasoning about trade-offs between them. However, the ``black box" nature of the RL models makes the decision process behind chosen objective trade-offs unclear. Current Explainable Reinforcement Learning (XRL) methods are typically designed for single scalar rewards and do not account for explanations with respect to distinct objectives or user preferences. To address this gap, in this paper we propose TREX, a Trajectory based Explainability framework to explain Multi-objective Reinforcement Learning policies, based on trajectory attribution. TREX generates trajectories directly from the learned expert policy, across different user preferences and clusters them into semantically meaningful temporal segments. We quantify the influence of these behavioural segments on the Pareto trade-off by training complementary policies that exclude specific clusters, measuring the resulting relative deviation on the observed rewards and actions compared to the original expert policy. Experiments on multi-objective MuJoCo environments - HalfCheetah, Ant and Swimmer, demonstrate the framework's ability to isolate and quantify the specific behavioural patterns.
0
cs.LGcs.AIcs.CV Bahar Dibaei Nia, Farzan Farnia · Mar 23, 2026

The paper addresses sample-efficient selection among multiple pretrained generative models, formulated as a diversity-aware multi-armed bandit problem where the optimal solution may be a mixture rather than a single model. The authors challenge the necessity of explicit UCB exploration bonuses, proposing that Mixture-Greedy—which directly optimizes empirical diversity objectives without optimism bonuses—can achieve sublinear regret through implicit exploration induced by the objective geometry. This matters because sampling from suboptimal generative models is computationally expensive, and their results suggest that structural properties of diversity metrics (FID, Vendi, RKE) naturally enforce sufficient exploration without costly confidence bound computations.

Efficient selection among multiple generative models is increasingly important in modern generative AI, where sampling from suboptimal models is costly. This problem can be formulated as a multi-armed bandit task. Under diversity-aware evaluation metrics, a non-degenerate mixture of generators can outperform any individual model, distinguishing this setting from classical best-arm identification. Prior approaches therefore incorporate an Upper Confidence Bound (UCB) exploration bonus into the mixture objective. However, across multiple datasets and evaluation metrics, we observe that the UCB term consistently slows convergence and often reduces sample efficiency. In contrast, a simple \emph{Mixture-Greedy} strategy without explicit UCB-type optimism converges faster and achieves even better performance, particularly for widely used metrics such as FID and Vendi where tight confidence bounds are difficult to construct. We provide theoretical insight explaining this behavior: under transparent structural conditions, diversity-aware objectives induce implicit exploration by favoring interior mixtures, leading to linear sampling of all arms and sublinear regret guarantees for entropy-based, kernel-based, and FID-type objectives. These results suggest that in diversity-aware multi-armed bandits for generative model selection, exploration can arise intrinsically from the objective geometry, questioning the necessity of explicit confidence bonuses.
0
eess.IVcs.AIcs.CV Jiaqi Shang, Haojin Wu, Yinyi Lai et al. · Mar 23, 2026

CICTM addresses deformable brain MRI registration by combining transformer-based global context modeling with cycle inverse-consistency constraints. The core idea uses a Swin-UNet to jointly estimate forward and backward deformation fields, penalizing inconsistencies at both image and flow levels while enforcing topology preservation via Jacobian regularization. The work matters for large-scale neuroimaging studies where deformation stability and physical plausibility are as important as alignment accuracy.

Deformable image registration plays a fundamental role in medical image analysis by enabling spatial alignment of anatomical structures across subjects. While recent deep learning-based approaches have significantly improved computational efficiency, many existing methods remain limited in capturing long-range anatomical correspondence and maintaining deformation consistency. In this work, we present a cycle inverse-consistent transformer-based framework for deformable brain MRI registration. The model integrates a Swin-UNet architecture with bidirectional consistency constraints, enabling the joint estimation of forward and backward deformation fields. This design allows the framework to capture both local anatomical details and global spatial relationships while improving deformation stability. We conduct a comprehensive evaluation of the proposed framework on a large multi-center dataset consisting of 2851 T1-weighted brain MRI scans aggregated from 13 public datasets. Experimental results demonstrate that the proposed framework achieves strong and balanced performance across multiple quantitative evaluation metrics while maintaining stable and physically plausible deformation fields. Detailed quantitative comparisons with baseline methods, including ANTs, ICNet, and VoxelMorph, are provided in the appendix. Experimental results demonstrate that CICTM achieves consistently strong performance across multiple evaluation criteria while maintaining stable and physically plausible deformation fields. These properties make the proposed framework suitable for large-scale neuroimaging datasets where both accuracy and deformation stability are critical.
0
cs.AI Mohammad Asadi, Jack W. O'Sullivan, Fang Cao et al. · Mar 23, 2026

This paper identifies a critical failure mode in multimodal AI evaluation called the 'mirage effect,' where vision-language models generate confident descriptions and reasoning about images that were never provided. The authors demonstrate that frontier models (GPT-5, Gemini-3-Pro, Claude Opus 4.5) retain 70–80% of their benchmark accuracy when evaluated without any visual input, with medical benchmarks showing 60–99% susceptibility to such non-visual inference. A text-only 3B-parameter model fine-tuned on chest X-ray questions outperforms both frontier multimodal systems and human radiologists, exposing how current benchmarks fail to distinguish genuine visual understanding from sophisticated textual pattern matching. The findings challenge the validity of accuracy metrics for multimodal systems and propose B-Clean, a method to filter benchmark questions that can be answered without images.

Multimodal AI systems have achieved remarkable performance across a broad range of real-world tasks, yet the mechanisms underlying visual-language reasoning remain surprisingly poorly understood. We report three findings that challenge prevailing assumptions about how these systems process and integrate visual information. First, Frontier models readily generate detailed image descriptions and elaborate reasoning traces, including pathology-biased clinical findings, for images never provided; we term this phenomenon mirage reasoning. Second, without any image input, models also attain strikingly high scores across general and medical multimodal benchmarks, bringing into question their utility and design. In the most extreme case, our model achieved the top rank on a standard chest X-ray question-answering benchmark without access to any images. Third, when models were explicitly instructed to guess answers without image access, rather than being implicitly prompted to assume images were present, performance declined markedly. Explicit guessing appears to engage a more conservative response regime, in contrast to the mirage regime in which models behave as though images have been provided. These findings expose fundamental vulnerabilities in how visual-language models reason and are evaluated, pointing to an urgent need for private benchmarks that eliminate textual cues enabling non-visual inference, particularly in medical contexts where miscalibrated AI carries the greatest consequence. We introduce B-Clean as a principled solution for fair, vision-grounded evaluation of multimodal AI systems.
0
cs.LGcs.AIstat.ME Marc Franquesa Mon\'es, Jiaqi Zhang, Caroline Uhler · Mar 23, 2026

Constraint-based causal discovery algorithms like PC require exponentially many conditional independence (CI) tests in the worst case---specifically $p^{\mathcal{O}(d)}$ where $d$ is the maximum degree. This paper establishes that the fundamental complexity parameter is actually $s$, the maximum undirected clique size in the essential graph, which can be much smaller than $d$ (e.g., $s=2$ vs $d=p-2$ in Figure 1). The authors propose Greedy Ancestral Search (GAS), which achieves $p^{\mathcal{O}(s)}$ CI tests, and prove a matching lower bound of $2^{\Omega(s)}$, establishing exponent-optimality up to a logarithmic factor.

Learning causal relations from observational data is a fundamental problem with wide-ranging applications across many fields. Constraint-based methods infer the underlying causal structure by performing conditional independence tests. However, existing algorithms such as the prominent PC algorithm need to perform a large number of independence tests, which in the worst case is exponential in the maximum degree of the causal graph. Despite extensive research, it remains unclear if there exist algorithms with better complexity without additional assumptions. Here, we establish an algorithm that achieves a better complexity of $p^{\mathcal{O}(s)}$ tests, where $p$ is the number of nodes in the graph and $s$ denotes the maximum undirected clique size of the underlying essential graph. Complementing this result, we prove that any constraint-based algorithm must perform at least $2^{\Omega(s)}$ conditional independence tests, establishing that our proposed algorithm achieves exponent-optimality up to a logarithmic factor in terms of the number of conditional independence tests needed. Finally, we validate our theoretical findings through simulations, on semi-synthetic gene-expression data, and real-world data, demonstrating the efficiency of our algorithm compared to existing methods in terms of number of conditional independence tests needed.
0
cs.AIcs.CV Xi Wang, Xu Yang, Donghao Sun et al. · Mar 23, 2026

Long-tail class incremental learning (LT-CIL) suffers from scarce tail-class data and catastrophic forgetting. This paper tackles both issues by using large language models to generate a stratified language tree (SL-Tree) that hierarchically organizes semantic information from coarse to fine granularity. Two parallel guidance mechanisms—adaptive language guidance with learnable per-class weights and alignment language guidance using semantic space stability—dynamically supervise tail classes and constrain optimization. The approach achieves reported state-of-the-art results on ImageNet-R, CIFAR100, and CUB200 benchmarks.

Long-tail class incremental learning (LT CIL) remains highly challenging because the scarcity of samples in tail classes not only hampers their learning but also exacerbates catastrophic forgetting under continuously evolving and imbalanced data distributions. To tackle these issues, we exploit the informativeness and scalability of language knowledge. Specifically, we analyze the LT CIL data distribution to guide large language models (LLMs) in generating a stratified language tree that hierarchically organizes semantic information from coarse to fine grained granularity. Building upon this structure, we introduce stratified adaptive language guidance, which leverages learnable weights to merge multi-scale semantic representations, thereby enabling dynamic supervisory adjustment for tail classes and alleviating the impact of data imbalance. Furthermore, we introduce stratified alignment language guidance, which exploits the structural stability of the language tree to constrain optimization and reinforce semantic visual alignment, thereby alleviating catastrophic forgetting. Extensive experiments on multiple benchmarks demonstrate that our method achieves state of the art performance.
0
cs.AI Ankush Agarwal, Harsh Vishwakarma, Suraj Nagaje et al. · Mar 23, 2026

EnterpriseLab tackles the challenge of deploying AI agents in enterprise settings where data sovereignty and cost constraints make frontier models impractical. The paper introduces a full-stack platform that unifies tool integration via Model Context Protocol (MCP), automated trajectory synthesis from environment schemas, and integrated training pipelines including a novel Agentic GRPO method. The core value proposition is that small 8B models can match GPT-4o on enterprise tasks while cutting inference costs by 8–10×, enabling on-premise deployment without sacrificing operational capability.

Deploying AI agents in enterprise environments requires balancing capability with data sovereignty and cost constraints. While small language models offer privacy-preserving alternatives to frontier models, their specialization is hindered by fragmented development pipelines that separate tool integration, data generation, and training. We introduce EnterpriseLab, a full-stack platform that unifies these stages into a closed-loop framework. EnterpriseLab provides (1) a modular environment exposing enterprise applications via Model Context Protocol, enabling seamless integration of proprietary and open-source tools; (2) automated trajectory synthesis that programmatically generates training data from environment schemas; and (3) integrated training pipelines with continuous evaluation. We validate the platform through EnterpriseArena, an instantiation with 15 applications and 140+ tools across IT, HR, sales, and engineering domains. Our results demonstrate that 8B-parameter models trained within EnterpriseLab match GPT-4o's performance on complex enterprise workflows while reducing inference costs by 8-10x, and remain robust across diverse enterprise benchmarks, including EnterpriseBench (+10%) and CRMArena (+10%). EnterpriseLab provides enterprises a practical path to deploying capable, privacy-preserving agents without compromising operational capability.