Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
199 papers in cs.AI
Trending mixes fresh papers with community signal.
0
cs.CLcs.AI Vinay Sharma, Manish Jain · Mar 22, 2026

This paper evaluates three inference-time strategies—self-consistency with temperature/top-p sampling, dual-model cross-verification, and iterative self-reflection—to improve multi-step reasoning in LLMs without parameter updates. The core premise is that aggregating diverse reasoning traces or validating across models yields more reliable outputs than single-pass decoding. The work addresses a practical need for deployment scenarios where retraining is infeasible, though the experimental scope is limited by unclear model specifications and dataset choices.

Large Language Models (LLMs) often exhibit strong linguistic abilities while remaining unreliable on multi-step reasoning tasks, particularly when deployed without additional training or fine-tuning. In this work, we study inference-time techniques to improve the reasoning accuracy of LLMs. We systematically evaluate three classes of inference-time strategies: (i) self-consistency via stochastic decoding, where the model is sampled multiple times using controlled temperature and nucleus sampling and the most frequent final answer is selected; (ii) dual-model reasoning agreement, where outputs from two independent models are compared and only consistent reasoning traces are trusted; and (iii) self-reflection, where the model critiques and revises its own reasoning. Across all evaluated methods, we employ Chain-of-Thought (CoT) [1] prompting to elicit explicit intermediate reasoning steps before generating final answers. In this work, we provide a controlled comparative evaluation across three inference-time strategies under identical prompting and verification settings. Our experiments on LLM [2] show that self-consistency with nucleus sampling and controlled temperature value yields the substantial gains, achieving a 9% to 15% absolute improvement in accuracy over greedy single-pass decoding, well-suited for low-risk domains, offering meaningful gains with minimal overhead. The dual-model approach provides additional confirmation for model reasoning steps thus more appropriate for moderate-risk domains, where higher reliability justifies additional compute. Self-reflection offers only marginal improvements, suggesting limited effectiveness for smaller non-reasoning models at inference time.
0
stat.MLcs.AIcs.CV Osamu Hirose, Emanuele Rodola · Mar 22, 2026

Domain Elastic Transform (DET) addresses the registration of high-dimensional vector-valued functions on irregular, sparse manifolds—a critical bottleneck in spatial transcriptomics where gene expression data resides on scattered cell positions rather than regular grids. The core idea is a Bayesian framework that treats registration as elastic domain deformation guided by a joint spatial-functional likelihood, bypassing the lossy voxelization required by image-based methods while exploiting functional signals that pure geometric point-set registration ignores. This matters because it enables training-free analysis of massive atlases (e.g., MERFISH, Stereo-seq) without sacrificing single-cell resolution.

Nonrigid registration is conventionally divided into point set registration, which aligns sparse geometries, and image registration, which aligns continuous intensity fields on regular grids. However, this dichotomy creates a critical bottleneck for emerging scientific data, such as spatial transcriptomics, where high-dimensional vector-valued functions, e.g., gene expression, are defined on irregular, sparse manifolds. Consequently, researchers currently face a forced choice: either sacrifice single-cell resolution via voxelization to utilize image-based tools, or ignore the critical functional signal to utilize geometric tools. To resolve this dilemma, we propose Domain Elastic Transform (DET), a grid-free probabilistic framework that unifies geometric and functional alignment. By treating data as functions on irregular domains, DET registers high-dimensional signals directly without binning. We formulate the problem within a rigorous Bayesian framework, modeling domain deformation as an elastic motion guided by a joint spatial-functional likelihood. The method is fully unsupervised and scalable, utilizing feature-sensitive downsampling to handle massive atlases. We demonstrate that DET achieves 92\% topological preservation on MERFISH data where state-of-the-art optimal transport methods struggle ($<$5\%), and successfully registers whole-embryo Stereo-seq atlases across developmental stages -- a task involving massive scale and complex nonrigid growth. The implementation of DET is available on {https://github.com/ohirose/bcpd} (since Mar, 2025).
0
cs.CLcs.AI Runze Sun, Yu Zheng, Zexuan Xiong et al. · Mar 22, 2026

This paper tackles multimodal hate speech detection where hateful intent emerges from complex interactions between text and images—what the authors call "more than the sum of its parts." The core innovation is the Stratified Multimodal Interaction (SMI) paradigm, which categorizes eight distinct cross-modal interaction patterns into three difficulty levels (Easy, Normal, Hard), coupled with the ARCADE framework that simulates an asymmetric courtroom debate between Prosecutor, Defender, and Judge agents to decipher subtle intent shifts. This matters because current detection systems fail when hateful content is constructed implicitly through benign-seeming modalities that only become toxic in combination.

Combating hate speech on social media is critical for securing cyberspace, yet relies heavily on the efficacy of automated detection systems. As content formats evolve, hate speech is transitioning from solely plain text to complex multimodal expressions, making implicit attacks harder to spot. Current systems, however, often falter on these subtle cases, as they struggle with multimodal content where the emergent meaning transcends the aggregation of individual modalities. To bridge this gap, we move beyond binary classification to characterize semantic intent shifts where modalities interact to construct implicit hate from benign cues or neutralize toxicity through semantic inversion. Guided by this fine-grained formulation, we curate the Hate via Vision-Language Interplay (H-VLI) benchmark where the true intent hinges on the intricate interplay of modalities rather than overt visual or textual slurs. To effectively decipher these complex cues, we further propose the Asymmetric Reasoning via Courtroom Agent DEbate (ARCADE) framework. By simulating a judicial process where agents actively argue for accusation and defense, ARCADE forces the model to scrutinize deep semantic cues before reaching a verdict. Extensive experiments demonstrate that ARCADE significantly outperforms state-of-the-art baselines on H-VLI, particularly for challenging implicit cases, while maintaining competitive performance on established benchmarks. Our code and data are available at: https://github.com/Sayur1n/H-VLI
0
cs.CVcs.AI Zhongyang Li, Yaqian Li, Faming Fang et al. · Mar 22, 2026

QMoP tackles the computational bottleneck in multimodal LLMs caused by excessive visual tokens, which dwarf text tokens in memory and compute costs. The paper proposes a Query Guided Mixture-of-Projector that dynamically combines three compression strategies—pooling for global semantics, resampling for high-level features, and pruning for fine-grained details—via a learned router. This adaptive approach matters because fixed compression rules inherently sacrifice different information types (global context vs. local details) depending on the task.

Multimodal large language models suffer from severe computational and memory bottlenecks, as the number of visual tokens far exceeds that of textual tokens. While recent methods employ projector modules to align and compress visual tokens into text-aligned features, they typically depend on fixed heuristics that limit adaptability across diverse scenarios. In this paper, we first propose Query Guided Mixture-of-Projector (QMoP), a novel and flexible framework that adaptively compresses visual tokens via three collaborative branches: (1) a pooling-based branch for coarse-grained global semantics, (2) a resampler branch for extracting high-level semantic representations, and (3) a pruning-based branch for fine-grained token selection to preserve critical visual detail. To adaptively coordinate these branches, we introduce the Query Guided Router (QGR), which dynamically selects and weights the outputs from different branches based on both visual input and textual queries. A Mixture-of-Experts-style fusion mechanism is designed to aggregate the outputs, harnessing the strengths of each strategy while suppressing noise. To systematically evaluate the effects of Visual Token Compression, we also develop VTCBench, a dedicated benchmark for evaluating the information loss induced by visual token compression. Extensive experiments demonstrate that despite relying on fundamental compression modules, QMoP outperforms strong baselines and delivers significant savings in memory, computation, and inference time.
0
cs.CRcs.AI Trung V. Phan, Thomas Bauschert · Mar 22, 2026

DeepXplain tackles the opacity of autonomous APT defense by integrating explainability signals directly into reinforcement learning rather than treating explanation as a post-hoc add-on. The framework augments provenance-graph-based DRL with an alignment loss that ties policy decisions to GNN-derived structural explanations and temporal attributions, coupled with a confidence-aware reward shaping term. The core claim is that this tight coupling improves both task performance (F1-score from 0.887 to 0.915) and explanation quality (confidence 0.86, fidelity 0.79) compared to black-box alternatives.

Advanced Persistent Threats (APTs) are stealthy, multi-stage attacks that require adaptive and timely defense. While deep reinforcement learning (DRL) enables autonomous cyber defense, its decisions are often opaque and difficult to trust in operational environments. This paper presents DeepXplain, an explainable DRL framework for stage-aware APT defense. Building on our prior DeepStage model, DeepXplain integrates provenance-based graph learning, temporal stage estimation, and a unified XAI pipeline that provides structural, temporal, and policy-level explanations. Unlike post-hoc methods, explanation signals are incorporated directly into policy optimization through evidence alignment and confidence-aware reward shaping. To the best of our knowledge, DeepXplain is the first framework to integrate explanation signals into reinforcement learning for APT defense. Experiments in a realistic enterprise testbed show improvements in stage-weighted F1-score (0.887 to 0.915) and success rate (84.7% to 89.6%), along with higher explanation confidence (0.86), improved fidelity (0.79), and more compact explanations (0.31). These results demonstrate enhanced effectiveness and trustworthiness of autonomous cyber defense.
0
cs.CRcs.AI Di Lu, Yongzhi Liao, Xutong Mu et al. · Mar 22, 2026

Host-acting agents let users state goals while the system figures out how to achieve them. This paper argues this convenience creates a novel attack surface: semantic under-specification. When users specify outcomes but not safety boundaries, agents must fill in missing semantics—and may choose security-divergent plans even when no attacker is present and the goal is benign.

Host-acting agents promise a convenient interaction model in which users specify goals and the system determines how to realize them. We argue that this convenience introduces a distinct security problem: semantic under-specification in goal specification. User instructions are typically goal-oriented, yet they often leave process constraints, safety boundaries, persistence, and exposure insufficiently specified. As a result, the agent must complete missing execution semantics before acting, and this completion can produce risky host-side plans even when the user-stated goal is benign. In this paper, we develop a semantic threat model, present a taxonomy of semantic-induced risky completion patterns, and study the phenomenon through an OpenClaw-centered case study and execution-trace analysis. We further derive defense design principles for making execution boundaries explicit and constraining risky completion. These findings suggest that securing host-acting agents requires governing not only which actions are allowed at execution time, but also how goal-only instructions are translated into executable plans.
0
cs.CVcs.AI Zhengxian Wu, Kai Shi, Chuanrui Zhang et al. · Mar 22, 2026

Current multimodal large language models rely on expensive annotated data or teacher distillation for reasoning improvements. This paper proposes an unsupervised self-evolution framework that trains without ground-truth labels or external reward models by instantiating dual roles—an Actor that generates multiple reasoning trajectories and a frozen Judge that modulates consistency-based rewards. The method employs group-wise distributional modeling using Group Relative Policy Optimization (GRPO) to convert absolute scores into relative advantages, achieving up to +5.9 absolute accuracy gains on MathVision while maintaining healthier training entropy than majority-voting baselines.

Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high-quality annotated data or teacher-model distillation, both of which are costly and difficult to scale.To address this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models. For each input, we sample multiple reasoning trajectories and jointly model their within group structure.We use the Actor's self-consistency signal as a training prior, and introduce a bounded Judge based modulation to continuously reweight trajectories of different quality.We further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates. Trained with Group Relative Policy Optimization (GRPO) on unlabeled data, our method consistently improves reasoning performance and generalization on five mathematical reasoning benchmarks, offering a scalable path toward self-evolving multimodal models.The code are available at https://dingwu1021.github.io/SelfJudge/.
0
cs.AI Keito Inoshita, Michiaki Omura, Tsukasa Yamanaka et al. · Mar 22, 2026

This paper investigates whether AI-assisted writing improves essay quality at the cost of homogenizing student thinking. Analyzing 6,875 essays across five conditions (Human-only, AI-only, and three Human+AI prompt strategies), the authors identify a Quality-Homogenization Tradeoff whereby substantial quality gains co-occur with structural convergence. The effect is dimension-specific: cohesion architecture loses 70–78% of its variance while perspective plurality diversifies, and prompt specificity can reverse homogenization into diversification on argument depth.

While AI-assisted writing has been widely reported to improve essay quality, its impact on the structural diversity of student thinking remains unexplored. Analyzing 6,875 essays across five conditions (Human-only, AI-only, and three Human+AI prompt strategies), we provide the first empirical evidence of a Quality-Homogenization Tradeoff, in which substantial quality gains co-occur with significant homogenization. The effect is dimension-specific: cohesion architecture lost 70-78% of its variance, whereas perspective plurality was diversified. Convergence target analysis further revealed that AI-augmented essays were pulled toward AI structural patterns yet deviated significantly from the Human-AI axis, indicating simultaneous partial replacement and partial emergence. Crucially, prompt specificity reversed homogenization into diversification on argument depth, demonstrating that homogenization is not an intrinsic property of AI but a function of interaction design.
0
cs.LGcs.AIcs.CV Minjong Cheon · Mar 22, 2026

Sonny tackles the compute barrier in medium-range weather forecasting by proposing a hierarchical transformer that trains on a single A40 GPU in 5.5 days. The core idea is a two-stage StepsNet pipeline: a narrow 'slow path' processes large-scale dynamics (U,V,Z,P) first, then a full-width 'fast path' integrates thermodynamics (T,Q). Combined with EMA during training, randomized dynamics forecasting, and pressure-weighted losses, Sonny aims to deliver competitive forecast skill without the TPU/GPU cluster requirements of models like Pangu-Weather or GraphCast.

Weather forecasting is a fundamental problem for protecting lives and infrastructure from high-impact atmospheric events. Recently, data-driven weather forecasting methods based on deep learning have demonstrated strong performance, often reaching accuracy levels competitive with operational numerical systems. However, many existing models rely on large-scale training regimes and compute-intensive architectures, which raises the practical barrier for academic groups with limited compute resources. Here we introduce Sonny, an efficient hierarchical transformer that achieves competitive medium-range forecasting performance while remaining feasible within reasonable compute budgets. At the core of Sonny is a two-stage StepsNet design: a narrow slow path first models large-scale atmospheric dynamics, and a subsequent full-width fast path integrates thermodynamic interactions. To stabilize medium-range rollout without an additional fine-tuning stage, we apply exponential moving average (EMA) during training. On WeatherBench2, Sonny yields robust medium-range forecast skill, remains competitive with operational baselines, and demonstrates clear advantages over FastNet, particularly at extended tropical lead times. In practice, Sonny can be trained to convergence on a single NVIDIA A40 GPU in approximately 5.5 days.
0
cs.CVcs.AI Tian Xia, Matthew Sinclair, Andreas Schuh et al. · Mar 22, 2026

Existing counterfactual image generation methods produce either global changes or require tedious user-defined masks. This paper proposes Positional Seg-CFT, which subdivides anatomical structures into regional segments (e.g., proximal, mid, distal) and derives independent measurements per region from pretrained segmentors. The extension enables spatially localized interventions for modeling regional disease progression, demonstrated on coronary CT angiography.

Counterfactual image generation enables controlled data augmentation, bias mitigation, and disease modeling. However, existing methods guided by external classifiers or regressors are limited to subject-level factors (e.g., age) and fail to produce localized structural changes, often resulting in global artifacts. Pixel-level guidance using segmentation masks has been explored, but requires user-defined counterfactual masks, which are tedious and impractical. Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT) addressed this by using segmentation-derived measurements to supervise structure-specific variables, yet it remains restricted to global interventions. We propose Positional Seg-CFT, which subdivides each structure into regional segments and derives independent measurements per region, enabling spatially localized and anatomically coherent counterfactuals. Experiments on coronary CT angiography show that Pos-Seg-CFT generates realistic, region-specific modifications, providing finer spatial control for modeling disease progression.
0
cs.LGcs.AIcs.SD Soudeep Ghoshal, Sandipan Chakraborty, Pradipto Chowdhury et al. · Mar 22, 2026

This paper tackles the tension between local melodic continuity and global structural coherence in symbolic music generation. It proposes a hybrid architecture fusing a Transformer encoder (for global patterns) with an LSTM decoder (for temporal precision), evaluating it against pure LSTM and Transformer baselines using 17 musical quality metrics on 1,000 generated melodies per model. The work matters because it provides systematic evidence that architectural hybridization can reconcile the complementary strengths of memory-based and attention-based models.

Machine learning techniques, such as Transformers and Long Short-Term Memory (LSTM) networks, play a crucial role in Symbolic Music Generation (SMG). Existing literature indicates a difference between LSTMs and Transformers regarding their ability to model local melodic continuity versus maintaining global structural coherence. However, their specific properties within the context of SMG have not been systematically studied. This paper addresses this gap by providing a fine-grained comparative analysis of LSTMs versus Transformers for SMG, examining local and global properties in detail using 17 musical quality metrics on the Deutschl dataset. We find that LSTM networks excel at capturing local patterns but fail to preserve long-range dependencies, while Transformers model global structure effectively but tend to produce irregular phrasing. Based on this analysis and leveraging their respective strengths, we propose a Hybrid architecture combining a Transformer Encoder with an LSTM Decoder and evaluate it against both baselines. We evaluated 1,000 generated melodies from each of the three architectures on the Deutschl dataset. The results show that the hybrid method achieves better local and global continuity and coherence compared to the baselines. Our work highlights the key characteristics of these models and demonstrates how their properties can be leveraged to design superior models. We also supported the experiments with ablation studies and human perceptual evaluations, which statistically support the findings and provide robust validation for this work.
0
cs.CYcs.AI Zongjie Li, Chaozheng Wang, Yuchong Xie et al. · Mar 22, 2026

WARBENCH is a benchmark for evaluating LLMs in military decision-making, addressing critical gaps in current frameworks by testing International Humanitarian Law (IHL) compliance, edge deployment constraints, fog-of-war robustness, and explicit reasoning. Using 136 high-fidelity scenarios derived from real post-WWII conflicts, the authors expose severe structural flaws: state-of-the-art models collapse under complex terrain and asymmetric force distributions, while edge-optimized models exhibit legal violation rates approaching 70%.

Large Language Models are increasingly being considered for deployment in safety-critical military applications. However, current benchmarks suffer from structural blindspots that systematically overestimate model capabilities in real-world tactical scenarios. Existing frameworks typically ignore strict legal constraints based on International Humanitarian Law (IHL), omit edge computing limitations, lack robustness testing for fog of war, and inadequately evaluate explicit reasoning. To address these vulnerabilities, we present WARBENCH, a comprehensive evaluation framework establishing a foundational tactical baseline alongside four distinct stress testing dimensions. Through a large scale empirical evaluation of nine leading models on 136 high-fidelity historical scenarios, we reveal severe structural flaws. First, baseline tactical reasoning systematically collapses under complex terrain and high force asymmetry. Second, while state of the art closed source models maintain functional compliance, edge-optimized small models expose extreme operational risks with legal violation rates approaching 70 percent. Furthermore, models experience catastrophic performance degradation under 4-bit quantization and systematic information loss. Conversely, explicit reasoning mechanisms serve as highly effective structural safeguards against inadvertent violations. Ultimately, these findings demonstrate that current models remain fundamentally unready for autonomous deployment in high stakes tactical environments.
0
cs.CRcs.AI Qiuchi Xiang, Haoxuan Qu, Hossein Rahmani et al. · Mar 22, 2026

This paper investigates the security of multi-agent LLM discussions under continuous monitoring, where anomaly detectors block suspicious inter-agent messages. The authors identify that existing attacks either exhibit detectable patterns (>93% detection rates) or become ineffective when adapted for stealth (<8% success). To address this, they develop a novel attack strategy using an adversarial-aware Friedkin-Johnsen opinion dynamics model to strategically select which agents to hijack and which targets to influence. Their findings demonstrate that even under continuous monitoring, attacks can achieve over 40% success rates, revealing that monitoring alone is insufficient to secure multi-agent systems.

Multi-agent discussions have been widely adopted, motivating growing efforts to develop attacks that expose their vulnerabilities. In this work, we study a practical yet largely unexplored attack scenario, the discussion-monitored scenario, where anomaly detectors continuously monitor inter-agent communications and block detected adversarial messages. Although existing attacks are effective without discussion monitoring, we show that they exhibit detectable patterns and largely fail under such monitoring constraints. But does this imply that monitoring alone is sufficient to secure multi-agent discussions? To answer this question, we develop a novel attack method explicitly tailored to the discussion-monitored scenario. Extensive experiments demonstrate that effective attacks remain possible even under continuous monitoring, indicating that monitoring alone does not eliminate adversarial risks.
0
cs.CLcs.AIcs.HC Pranav Hemanth, Sampriti Saha · Mar 22, 2026

This paper tackles logical context poisoning—the degradation of LLM responses when flat, linear conversation structures force topically distinct threads to accumulate in a single unbounded context window. The core idea is the Conversation Tree Architecture (CTA), which models conversations as a directed rooted tree $\mathcal{T}=(V,E,r,W)$ where each node $v \in V$ maintains an isolated local context window $w_v$. Structured flow operations—downstream passing $\phi_{\downarrow}$, upstream merging $\psi_{\uparrow}$, and volatile nodes—govern how context moves between branches. This matters because current interfaces offer no middle ground between discarding context (new chat) and accumulating noise (linear threads).

Large language models (LLMs) are increasingly deployed for extended, multi-topic conversations, yet the flat, append-only structure of current conversation interfaces introduces a fundamental limitation: all context accumulates in a single unbounded window, causing topically distinct threads to bleed into one another and progressively degrade response quality. We term this failure mode logical context poisoning. In this paper, we introduce the Conversation Tree Architecture (CTA), a hierarchical framework that organizes LLM conversations as trees of discrete, context-isolated nodes. Each node maintains its own local context window; structured mechanisms govern how context flows between parent and child nodes, downstream on branch creation and upstream on branch deletion. We additionally introduce volatile nodes, transient branches whose local context must be selectively merged upward or permanently discarded before purging. We formalize the architecture's primitives, characterize the open design problems in context flow, relate our framework to prior work in LLM memory management, and describe a working prototype implementation. The CTA provides a principled foundation for structured conversational context management and extends naturally to multi-agent settings.
0
cs.LGcs.AI Zihan Fang, Qianru Wang, Haonan An et al. · Mar 22, 2026

The paper addresses federated fine-tuning of Mixture-of-Experts (MoE) based large language models under non-IID data distributions, where direct parameter aggregation causes gating preference misalignment and expert semantic blurring. The proposed FedAlign-MoE framework introduces consistency-based gating distribution alignment using routing consistency weighting ($\omega_i(e) = s_i(e)/\sum_j s_j(e)$) and semantic-aware expert aggregation via region-conditioned gated weights ($\gamma_{i,j}(e)$). This matters because MoE architectures are increasingly vital for scaling LLMs efficiently, yet data heterogeneity across federated clients undermines their specialization benefits.

Large language models (LLMs) increasingly adopt Mixture-of-Experts (MoE) architectures to scale model capacity while reducing computation. Fine-tuning these MoE-based LLMs often requires access to distributed and privacy-sensitive data, making centralized fine-tuning impractical. Federated learning (FL) therefore provides a paradigm to collaboratively fine-tune MoE-based LLMs, enabling each client to integrate diverse knowledge without compromising data privacy. However, the integration of MoE-based LLM fine-tuning into FL encounters two critical aggregation challenges due to inherent data heterogeneity across clients: (i) divergent local data distributions drive clients to develop distinct gating preference for localized expert selection, causing direct parameter aggregation to produce a ``one-size-fits-none'' global gating network, and (ii) same-indexed experts develop disparate semantic roles across clients, leading to expert semantic blurring and the degradation of expert specialization. To address these challenges, we propose FedAlign-MoE, a federated aggregation alignment framework that jointly enforces routing consistency and expert semantic alignment. Specifically, FedAlign-MoE aggregates gating behaviors by aligning routing distributions through consistency weighting and optimizes local gating networks through distribution regularization, maintaining cross-client stability without overriding discriminative local preferences. Meanwhile, FedAlign-MoE explicitly quantifies semantic consistency among same-indexed experts across clients and selectively aggregates updates from semantically aligned clients, ensuring stable and specialized functional roles for global experts. Extensive experiments demonstrate that FedAlign-MoE outperforms state-of-the-art benchmarks, achieving faster convergence and superior accuracy in non-IID federated environments.
0
cs.CLcs.AIcs.DL Sai Koneru, Jian Wu, Sarah Rajtmajer · Mar 22, 2026

Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is challenging due to document length and the distribution of scientific arguments across sections. This paper proposes a two-stage retrieve-and-extract pipeline that first links an abstract finding to its corresponding hypothesis, then extracts the statistical evidence supporting that hypothesis. Through controlled ablations varying context quantity ($k \in \{5, 10, 20\}$), retrieval quality (standard RAG, reranking, fine-tuned retriever), and oracle paragraph settings, the authors demonstrate that hypothesis extraction is primarily bounded by retrieval quality, while evidence extraction faces persistent extractor limitations even with perfect paragraph selection.

Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is central to the synthesis of empirical findings, but remains difficult due to document length and the distribution of scientific arguments across sections of the paper. The work studies a sequential full-text extraction setting, where the statement of a primary finding in an article's abstract is linked to (i) a corresponding hypothesis statement in the paper body and (ii) the statistical evidence that supports or refutes that hypothesis. This formulation induces a challenging within-document retrieval setting in which many candidate paragraphs are topically related to the finding but differ in rhetorical role, creating hard negatives for retrieval and extraction. Using a two-stage retrieve-and-extract framework, we conduct a controlled study of retrieval design choices, varying context quantity, context quality (standard Retrieval Augmented Generation, reranking, and a fine-tuned retriever paired with reranking), as well as an oracle paragraph setting to separate retrieval failures from extraction limits across four Large Language Model extractors. We find that targeted context selection consistently improves hypothesis extraction relative to full-text prompting, with gains concentrated in configurations that optimize retrieval quality and context cleanliness. In contrast, statistical evidence extraction remains substantially harder. Even with oracle paragraphs, performance remains moderate, indicating persistent extractor limitations in handling hybrid numeric-textual statements rather than retrieval failures alone.
0
cs.AIcs.CLcs.DS Zachary F. Mainen · Mar 22, 2026

This paper formalizes the transformer context window as an I/O page and proves that tool-augmented agents with indexed external memory achieve exponential retrieval cost savings over sequential scanning: $\mathcal{O}(\log_b N)$ versus $\Omega(N)$ page reads. The authors validate these predictions experimentally across three content types and identify "parametric memory competition" as a failure mode where models bypass retrieval protocols for familiar content.

Externalized reasoning is already exploited by transformer-based agents through chain-of-thought, but structured retrieval -- indexing over one's own reasoning state -- remains underexplored. We formalize the transformer context window as an I/O page and prove that tool-augmented agents with indexed external memory achieve exponentially lower retrieval cost than agents restricted to sequential scanning: $O(\log_b N)$ versus $\Omega(N)$ page reads per query, and $O(T \log_b T)$ versus $\Theta(T^2)$ cumulative cost over $T$ reasoning steps -- a gap that widens as deliberation deepens. We test these predictions on a controlled lookup benchmark across three content types -- random hashes, ordered integers, and encyclopedia entries -- varying store size from 50 to 5,000 items, and replicate key conditions across two model generations (GPT-4o-mini and GPT-5.4). On abstract content, the indexed agent achieves median 1 page read regardless of store size, confirming the $O(1)$ prediction. Sorted pages without an index fail to close the gap: the weaker model cannot sustain binary search at scale, and the stronger model achieves near-optimal $\log_2 N$ search but still loses to the index by $5\times$. On familiar content (encyclopedia entries), a competing failure mode emerges: the model recognizes the domain, bypasses the retrieval protocol, and generates answers from parametric memory, producing catastrophic token expenditure even when the index is sound. This parametric memory competition dissociates the two cognitive operations that indexing combines: understanding content (where language models excel) and following navigational protocols (where they fail when understanding tempts them to shortcut). The result argues for a separation of concerns: use language models for index construction, where semantic understanding helps, and deterministic algorithms for index traversal, where it hurts.
0
cs.SEcs.AI Miryala Sathvika, Rudra Dhar, Karthik Vaidhyanathan · Mar 22, 2026

The paper tackles the labor-intensive challenge of creating software architecture views, which are essential for documentation but often become outdated—75\% are never updated after creation. The authors conduct a large-scale empirical study evaluating whether LLMs and agentic approaches can automate view generation from source code, testing 3 LLMs across 3 prompting strategies and 2 agentic approaches on 340 repositories. This matters because as systems grow complex, automated view generation could bridge the gap between implementation and architectural documentation, potentially alleviating the manual burden that leads to outdated artifacts.

Architecture views are essential for software architecture documentation, yet their manual creation is labor intensive and often leads to outdated artifacts. As systems grow in complexity, the automated generation of views from source code becomes increasingly valuable. Goal: We empirically evaluate the ability of LLMs and agentic approaches to generate architecture views from source code. Method: We analyze 340 open-source repositories across 13 experimental configurations using 3 LLMs with 3 prompting techniques and 2 agentic approaches, yielding 4,137 generated views. We evaluate the generated views by comparing them with the ground-truth using a combination of automated metrics complemented by human evaluations. Results: Prompting strategies offer marginal improvements. Few-shot prompting reduces clarity failures by 9.2% compared to zero-shot baselines. The custom agentic approach consistently outperforms the general-purpose agent, achieving the best clarity (22.6% failure rate) and level-of-detail success (50%). Conclusions: LLM and agentic approaches demonstrate capabilities in generating syntactically valid architecture views. However, they consistently exhibit granularity mismatches, operating at the code level rather than architectural abstractions. This suggests that there is still a need for human expertise, positioning LLMs and agents as assistive tools rather than autonomous architects.
0
physics.geo-phcs.AI Feng Liu, Jian Xu, Xin Cui et al. · Mar 22, 2026

TRACE is a multi-agent LLM system designed to automate end-to-end seismological analysis, from raw waveform processing to physical mechanism inference. The framework addresses the longstanding bottleneck of expert-dependent interpretation in seismology by orchestrating modules for catalog construction, statistical analysis, and cross-perspective reasoning, demonstrated on two distinct tectonic environments: the 2019 Ridgecrest earthquake sequence and the 2025 Santorini-Kolumbo volcanic crisis.

Inferring the physical mechanisms that govern earthquake sequences from indirect geophysical observations remains difficult, particularly across tectonically distinct environments where similar seismic patterns can reflect different underlying processes. Current interpretations rely heavily on the expert synthesis of catalogs, spatiotemporal statistics, and candidate physical models, limiting reproducibility and the systematic transfer of insight across settings. Here we present TRACE (Trans-perspective Reasoning and Automated Comprehensive Evaluator), a multi-agent system that combines large language model planning with formal seismological constraints to derive auditable, physically grounded mechanistic inference from raw observations. Applied to the 2019 Ridgecrest sequence, TRACE autonomously identifies stress-perturbation-induced delayed triggering, resolving the cascading interaction between the Mw 6.4 and Mw 7.1 mainshocks; in the Santorini-Kolumbo case, the system identifies a structurally guided intrusion model, distinguishing fault-channeled episodic migration from the continuous propagation expected in homogeneous crustal failure. By providing a generalizable logical infrastructure for interpreting heterogeneous seismic phenomena, TRACE advances the field from expert-dependent analysis toward knowledge-guided autonomous discovery in Earth sciences.
0
cs.SEcs.AI Octavian Untila · Mar 22, 2026

This paper reports that an autonomous AI ecosystem (SUBSTRATE S3) independently discovered the need for Z3 SMT-based formal verification across six distinct domains—ranging from LLM code to tool APIs to hardware assembly—without being explicitly instructed to do so. The authors treat this convergence as evidence that formal verification "emerges" as a fundamental property of AI systems reasoning about safety. They then present substrate-guard, a unified Python framework implementing Z3 verification across five AI output classes. The claim matters because if true, it would suggest AI systems naturally recognize the limitations of empirical testing and converge on mathematical proof as a safety mechanism.

An autonomous AI ecosystem (SUBSTRATE S3), generating product specifications without explicit instructions about formal methods, independently proposed the use of Z3 SMT solver across six distinct domains of AI safety: verification of LLM-generated code, tool API safety for AI agents, post-distillation reasoning correctness, CLI command validation, hardware assembly verification, and smart contract safety. These convergent discoveries, occurring across 8 products over 13 days with Jaccard similarity below 15% between variants, suggest that formal verification is not merely a useful technique for AI safety but an emergent property of any sufficiently complex system reasoning about its own safety. We propose a unified framework (substrate-guard) that applies Z3-based verification across all six output classes through a common API, and evaluate it on 181 test cases across five implemented domains, achieving 100% classification accuracy with zero false positives and zero false negatives. Our framework detected real bugs that empirical testing would miss, including an INT_MIN overflow in branchless RISC-V assembly and mathematically proved that unconstrained string parameters in tool APIs are formally unverifiable.