Feed - arxlens

0

cs.SE cs.LG Idan Amit · Mar 22, 2026

This paper investigates which static analysis alert removals actually reduce bug rates—a critical question since developers constantly face noisy linting warnings. The author employs three complementary methods: a randomized controlled trial with 521 manual interventions, labeling functions to identify intervention-like events in 8,245 natural commits, and supervised learning to predict beneficial removals. The core finding is that removing complexity alerts (too-many-branches, too-many-nested-blocks) via method extraction reduces bug tendency by 4.1–5.5 percentage points, offering evidence-based guidance for prioritizing refactoring efforts.

Context: Static analysis captures software engineering knowledge and alerts on possibly problematic patterns. Previous work showed that they indeed have predictive power for various problems. However, the impact of removing the alerts is unclear. Aim: We would like to evaluate the impact of alert removals on code complexity and the tendency to bugs. Method: We evaluate the impact of removing alerts using three complementary methods. 1. We conducted a randomized controlled trial and built a dataset of 521 manual alert-removing interventions 2. We profiled intervention-like events using labeling functions. We applied these labeling functions to code commits, found intervention-like natural events, and used them to analyze the impact on the tendency to bugs. 3. We built a dataset of 8,245 alert removals, more than 15 times larger than our dataset of manual interventions. We applied supervised learning to the alert removals, aiming to predict their impact on the tendency to bugs. Results: We identified complexity-reducing interventions that reduce the probability of future bugs. Such interventions are relevant to 33\% of Python files and might reduce the tendency to bugs by 5.5 percentage points. Conclusions: We presented methods to evaluate the impact of interventions. The methods can identify a large number of natural interventions that are highly needed in causality research in many domains.

Read abstractHide abstract

0

Do Papers Match Code? A Benchmark and Framework for Paper-Code Consistency Detection in Bioinformatics Software

cs.LG cs.SE Tianxiang Xu, Xiaoyan Zhu, Xin Lai et al. · Mar 23, 2026

This paper addresses paper-code consistency detection in bioinformatics, tackling the reproducibility crisis where algorithmic descriptions in publications often diverge from software implementations. The authors introduce BioCon, a benchmark of 48 bioinformatics projects with expert-annotated sentence-code pairs, and propose a cross-modal framework using UniXcoder with weighted focal loss. While the task is important for computational biology reproducibility, claims of novelty require qualification given concurrent efforts in the broader scientific community.

Ensuring consistency between research papers and their corresponding software implementations is fundamental to software reliability and scientific reproducibility. However, this problem remains underexplored, particularly in the domain of bioinformatics, where discrepancies between methodological descriptions in papers and their actual code implementations are prevalent. To address this gap, this paper introduces a new task, namely paper-code consistency detection, and curates a collection of 48 bioinformatics software projects along with their associated publications. We systematically align sentence-level algorithmic descriptions from papers with function-level code snippets. Combined with expert annotations and a hybrid negative sampling strategy, we construct the first benchmark dataset in the bioinformatics domain tailored to this task, termed BioCon. Based on this benchmark, we further propose a cross-modal consistency detection framework designed to model the semantic relationships between natural language descriptions and code implementations. The framework adopts a unified input representation and leverages pre-trained models to capture deep semantic alignment between papers and code. To mitigate the effects of class imbalance and hard samples, we incorporate a weighted focal loss to enhance model robustness. Experimental results demonstrate that our framework effectively identifies consistency between papers and code in bioinformatics, achieving an accuracy of 0.9056 and an F1 score of 0.8011. Overall, this study opens a new research direction for paper-code consistency analysis and lays the foundation for automated reproducibility assessment and cross-modal understanding in scientific software.

Read abstractHide abstract

0

Reasoning Provenance for Autonomous AI Agents: Structured Behavioral Analytics Beyond State Checkpoints and Execution Traces

cs.AI cs.DC cs.SE Neelmani Vispute · Mar 23, 2026

As AI agents move from human-supervised copilots to fully autonomous infrastructure, organizations face a critical observability gap: existing systems capture computational state and execution traces but lack structured records of the agent's reasoning. This paper introduces the Agent Execution Record (AER), a schema-level primitive that captures intent, observation, and inference as first-class queryable fields at execution time. The core claim is that reasoning provenance cannot be faithfully reconstructed from state checkpoints due to fundamental non-identifiability (intent multiplicity, observation ambiguity, inference volatility). If validated, AERs would enable population-level behavioral analytics—systematic comparison of reasoning patterns across thousands of investigations, confidence calibration against expert judgments, and counterfactual regression testing via mock replay—that existing tooling achieves only through fragile post-hoc extraction.

As AI agents transition from human-supervised copilots to autonomous platform infrastructure, the ability to analyze their reasoning behavior across populations of investigations becomes a pressing infrastructure requirement. Existing operational tooling addresses adjacent needs effectively: state checkpoint systems enable fault tolerance; observability platforms provide execution traces for debugging; telemetry standards ensure interoperability. What current systems do not natively provide as a first-class, schema-level primitive is structured reasoning provenance -- normalized, queryable records of why the agent chose each action, what it concluded from each observation, how each conclusion shaped its strategy, and which evidence supports its final verdict. This paper introduces the Agent Execution Record (AER), a structured reasoning provenance primitive that captures intent, observation, and inference as first-class queryable fields on every step, alongside versioned plans with revision rationale, evidence chains, structured verdicts with confidence scores, and delegation authority chains. We formalize the distinction between computational state persistence and reasoning provenance, argue that the latter cannot in general be faithfully reconstructed from the former, and show how AERs enable population-level behavioral analytics: reasoning pattern mining, confidence calibration, cross-agent comparison, and counterfactual regression testing via mock replay. We present a domain-agnostic model with extensible domain profiles, a reference implementation and SDK, and outline an evaluation methodology informed by preliminary deployment on a production platformized root cause analysis agent.

Read abstractHide abstract

0

LLM-Based Test Case Generation in DBMS through Monte Carlo Tree Search

cs.SE cs.AI Yujia Chen, Yingli Zhou, Fangyuan Zhang et al. · Mar 23, 2026

MIST addresses the challenge of generating high-quality SQL test cases for Database Management Systems using lightweight Large Language Models. The framework combines a feature-guided synthesis stage that leverages hierarchical documentation structures with error feedback, and a Monte Carlo Tree Search-based mutation stage to overcome coverage plateaus. This two-pronged approach aims to achieve high code coverage in resource-constrained industrial environments where only small LLMs can be deployed locally.

Database Management Systems (DBMSs) are fundamental infrastructure for modern data-driven applications, where thorough testing with high-quality SQL test cases is essential for ensuring system reliability. Traditional approaches such as fuzzing can be effective for specific DBMSs, but adapting them to different proprietary dialects requires substantial manual effort. Large Language Models (LLMs) present promising opportunities for automated SQL test generation, but face critical challenges in industrial environments. First, lightweight models are widely used in organizations due to security and privacy constraints, but they struggle to generate syntactically valid queries for proprietary SQL dialects. Second, LLM-generated queries are often semantically similar and exercise only shallow execution paths, thereby quickly reaching a coverage plateau. To address these challenges, we propose MIST, an LLM-based test case generatIon framework for DBMS through Monte Carlo Tree search. MIST consists of two stages: Feature-Guided Error-Driven Test Case Synthetization, which constructs a hierarchical feature tree and uses error feedback to guide LLM generation, aiming to produce syntactically valid and semantically diverse queries for different DBMS dialects, and Monte Carlo Tree Search-Based Test Case Mutation, which jointly optimizes seed query selection and mutation rule application guided by coverage feedback, aiming at boosting code coverage by exploring deeper execution paths. Experiments on three widely-used DBMSs with four lightweight LLMs show that MIST achieves average improvements of 43.3% in line coverage, 32.3% in function coverage, and 46.4% in branch coverage compared to the baseline approach with the highest line coverage of 69.3% in the Optimizer module.

Read abstractHide abstract

0

Efficient Failure Management for Multi-Agent Systems with Reasoning Trace Representation

cs.SE cs.AI Lingzhe Zhang, Tong Jia, Mingyu Wang et al. · Mar 23, 2026

This paper addresses the challenge of efficient failure management in LLM-based Multi-Agent Systems (MASs). Existing approaches rely on expensive per-trace reasoning with large judge LLMs, which is slow and unstable. The core contribution is EAGER, a framework that uses unsupervised reasoning-scoped contrastive learning to encode intra-agent and inter-agent dynamics into embeddings, enabling real-time step-wise failure detection and reflexive mitigation guided by historical patterns rather than costly LLM inference.

Large Language Models (LLM)-based Multi-Agent Systems (MASs) have emerged as a new paradigm in software system design, increasingly demonstrating strong reasoning and collaboration capabilities. As these systems become more complex and autonomous, effective failure management is essential to ensure reliability and availability. However, existing approaches often rely on per-trace reasoning, which leads to low efficiency, and neglect historical failure patterns, limiting diagnostic accuracy. In this paper, we conduct a preliminary empirical study to demonstrate the necessity, potential, and challenges of leveraging historical failure patterns to enhance failure management in MASs. Building on this insight, we propose \textbf{EAGER}, an efficient failure management framework for multi-agent systems based on reasoning trace representation. EAGER employs unsupervised reasoning-scoped contrastive learning to encode both intra-agent reasoning and inter-agent coordination, enabling real-time step-wise failure detection, diagnosis, and reflexive mitigation guided by historical failure knowledge. Preliminary evaluations on three open-source MASs demonstrate the effectiveness of EAGER and highlight promising directions for future research in reliable multi-agent system operations.

Read abstractHide abstract

0

LLM-Powered Workflow Optimization for Multidisciplinary Software Development: An Automotive Industry Case Study

cs.SE cs.AI Shuai Wang, Yinan Yu, Earl Barr et al. · Mar 22, 2026

This paper tackles the persistent bottleneck of Multidisciplinary Software Development (MSD), where domain experts and software developers must manually coordinate across heterogeneous artifacts and incompatible formalisms. The authors model MSD workflows as a directed dependency graph $\mathcal{G}=(\mathcal{V},\mathcal{R})$ and propose an iterative optimization framework that replaces manual translation nodes with LLM-powered services. This matters because their approach reduces per-API development time from approximately 5 hours to under 7 minutes while maintaining production-quality code, demonstrating that workflow-level automation—not just coding assistance—can unlock substantial efficiency gains in industrial settings.

Multidisciplinary Software Development (MSD) requires domain experts and developers to collaborate across incompatible formalisms and separate artifact sets. Today, even with AI coding assistants like GitHub Copilot, this process remains inefficient; individual coding tasks are semi-automated, but the workflow connecting domain knowledge to implementation is not. Developers and experts still lack a shared view, resulting in repeated coordination, clarification rounds, and error-prone handoffs. We address this gap through a graph-based workflow optimization approach that progressively replaces manual coordination with LLM-powered services, enabling incremental adoption without disrupting established practices. We evaluate our approach on \texttt{spapi}, a production in-vehicle API system at Volvo Group involving 192 endpoints, 420 properties, and 776 CAN signals across six functional domains. The automated workflow achieves 93.7\% F1 score while reducing per-API development time from approximately 5 hours to under 7 minutes, saving an estimated 979 engineering hours. In production, the system received high satisfaction from both domain experts and developers, with all participants reporting full satisfaction with communication efficiency.

Read abstractHide abstract

0

RuntimeSlicer: Towards Generalizable Unified Runtime State Representation for Failure Management

cs.SE cs.AI Lingzhe Zhang, Tong Jia, Weijie Hong et al. · Mar 23, 2026

Modern failure management pipelines tightly couple task-specific models with modality-specific encoders, blocking reuse across systems. RuntimeSlicer proposes a unified runtime state representation that encodes metrics, traces, and logs into a single embedding via Unified Runtime Contrastive Learning, then adapts to downstream tasks through State-Aware Task-Oriented Tuning. The core value is decoupling representation learning from failure management tasks—if it generalizes, teams could freeze the embedding backbone and ship lightweight task heads.

Modern software systems operate at unprecedented scale and complexity, where effective failure management is critical yet increasingly challenging. Metrics, traces, and logs provide complementary views of system runtime behavior, but existing failure management approaches typically rely on task-oriented pipelines that tightly couple modality-specific preprocessing, representation learning, and downstream models, resulting in limited generalization across tasks and systems. To fill this gap, we propose RuntimeSlicer, a unified runtime state representation model towards generalizable failure management. RuntimeSlicer pre-trains a task-agnostic representation model that directly encodes metrics, traces, and logs into a single, aligned system-state embedding capturing the holistic runtime condition of the system. To train RuntimeSlicer, we introduce Unified Runtime Contrastive Learning, which integrates heterogeneous training data sources and optimizes complementary objectives for cross-modality alignment and temporal consistency. Building upon the learned system-state embeddings, we further propose State-Aware Task-Oriented Tuning, which performs unsupervised partitioning of runtime states and enables state-conditioned adaptation for downstream tasks. This design allows lightweight task-oriented models to be trained on top of the unified embedding without redesigning modality-specific encoders or preprocessing pipelines. Preliminary experiments on the AIOps 2022 dataset demonstrate the feasibility and effectiveness of RuntimeSlicer for system state modeling and failure management tasks.

Read abstractHide abstract

0

DomAgent: Leveraging Knowledge Graphs and Case-Based Reasoning for Domain-Specific Code Generation

cs.AI cs.SE Shuai Wang, Dhasarathy Parthasarathy, Robert Feldt et al. · Mar 22, 2026

DomAgent addresses the challenge of generating code for specialized domains like truck control systems or data science libraries, where generic LLMs often fail due to lack of domain knowledge. The system combines structured knowledge graphs (top-down reasoning) with case-based retrieval (bottom-up learning) through a novel DomRetriever module that iteratively refines context via LLM-based review. Experiments on both the DS-1000 benchmark and a real-world truck software dataset demonstrate substantial improvements, enabling small 7B-8B parameter models to approach or exceed the performance of proprietary systems like GPT-4o.

Large language models (LLMs) have shown impressive capabilities in code generation. However, because most LLMs are trained on public domain corpora, directly applying them to real-world software development often yields low success rates, as these scenarios frequently require domain-specific knowledge. In particular, domain-specific tasks usually demand highly specialized solutions, which are often underrepresented or entirely absent in the training data of generic LLMs. To address this challenge, we propose DomAgent, an autonomous coding agent that bridges this gap by enabling LLMs to generate domain-adapted code through structured reasoning and targeted retrieval. A core component of DomAgent is DomRetriever, a novel retrieval module that emulates how humans learn domain-specific knowledge, by combining conceptual understanding with experiential examples. It dynamically integrates top-down knowledge-graph reasoning with bottom-up case-based reasoning, enabling iterative retrieval and synthesis of structured knowledge and representative cases to ensure contextual relevance and broad task coverage. DomRetriever can operate as part of DomAgent or independently with any LLM for flexible domain adaptation. We evaluate DomAgent on an open benchmark dataset in the data science domain (DS-1000) and further apply it to real-world truck software development tasks. Experimental results show that DomAgent significantly enhances domain-specific code generation, enabling small open-source models to close much of the performance gap with large proprietary LLMs in complex, real-world applications. The code is available at: https://github.com/Wangshuaiia/DomAgent.

Read abstractHide abstract

0

LLM-based Automated Architecture View Generation: Where Are We Now?

cs.SE cs.AI Miryala Sathvika, Rudra Dhar, Karthik Vaidhyanathan · Mar 22, 2026

The paper tackles the labor-intensive challenge of creating software architecture views, which are essential for documentation but often become outdated—75\% are never updated after creation. The authors conduct a large-scale empirical study evaluating whether LLMs and agentic approaches can automate view generation from source code, testing 3 LLMs across 3 prompting strategies and 2 agentic approaches on 340 repositories. This matters because as systems grow complex, automated view generation could bridge the gap between implementation and architectural documentation, potentially alleviating the manual burden that leads to outdated artifacts.

Architecture views are essential for software architecture documentation, yet their manual creation is labor intensive and often leads to outdated artifacts. As systems grow in complexity, the automated generation of views from source code becomes increasingly valuable. Goal: We empirically evaluate the ability of LLMs and agentic approaches to generate architecture views from source code. Method: We analyze 340 open-source repositories across 13 experimental configurations using 3 LLMs with 3 prompting techniques and 2 agentic approaches, yielding 4,137 generated views. We evaluate the generated views by comparing them with the ground-truth using a combination of automated metrics complemented by human evaluations. Results: Prompting strategies offer marginal improvements. Few-shot prompting reduces clarity failures by 9.2% compared to zero-shot baselines. The custom agentic approach consistently outperforms the general-purpose agent, achieving the best clarity (22.6% failure rate) and level-of-detail success (50%). Conclusions: LLM and agentic approaches demonstrate capabilities in generating syntactically valid architecture views. However, they consistently exhibit granularity mismatches, operating at the code level rather than architectural abstractions. This suggests that there is still a need for human expertise, positioning LLMs and agents as assistive tools rather than autonomous architects.

Read abstractHide abstract

0

Emergent Formal Verification: How an Autonomous AI Ecosystem Independently Discovered SMT-Based Safety Across Six Domains

cs.SE cs.AI Octavian Untila · Mar 22, 2026

This paper reports that an autonomous AI ecosystem (SUBSTRATE S3) independently discovered the need for Z3 SMT-based formal verification across six distinct domains—ranging from LLM code to tool APIs to hardware assembly—without being explicitly instructed to do so. The authors treat this convergence as evidence that formal verification "emerges" as a fundamental property of AI systems reasoning about safety. They then present substrate-guard, a unified Python framework implementing Z3 verification across five AI output classes. The claim matters because if true, it would suggest AI systems naturally recognize the limitations of empirical testing and converge on mathematical proof as a safety mechanism.

An autonomous AI ecosystem (SUBSTRATE S3), generating product specifications without explicit instructions about formal methods, independently proposed the use of Z3 SMT solver across six distinct domains of AI safety: verification of LLM-generated code, tool API safety for AI agents, post-distillation reasoning correctness, CLI command validation, hardware assembly verification, and smart contract safety. These convergent discoveries, occurring across 8 products over 13 days with Jaccard similarity below 15% between variants, suggest that formal verification is not merely a useful technique for AI safety but an emergent property of any sufficiently complex system reasoning about its own safety. We propose a unified framework (substrate-guard) that applies Z3-based verification across all six output classes through a common API, and evaluate it on 181 test cases across five implemented domains, achieving 100% classification accuracy with zero false positives and zero false negatives. Our framework detected real bugs that empirical testing would miss, including an INT_MIN overflow in branchless RISC-V assembly and mathematically proved that unconstrained string parameters in tool APIs are formally unverifiable.

Read abstractHide abstract

Nothing here yet