Effective Strategies for Asynchronous Software Engineering Agents

cs.CL cs.AI Jiayi Geng, Graham Neubig · Mar 23, 2026
Local to this browser
What it does
CAID tackles long-horizon software engineering tasks where single agents struggle with accuracy and wall-clock time. The core idea is Centralized Asynchronous Isolated Delegation: a manager decomposes tasks into dependency graphs and...
Why it matters
7% absolute on PaperBench and 14. 3% on Commit0, demonstrating that structured coordination grounded in SWE primitives outperforms simply scaling single-agent iteration budgets.
Main concern
CAID presents a compelling case for branch-and-merge coordination as a foundational primitive for multi-agent software engineering. The paper demonstrates consistent gains across three LLMs and two benchmarks, with particularly strong...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

CAID tackles long-horizon software engineering tasks where single agents struggle with accuracy and wall-clock time. The core idea is Centralized Asynchronous Isolated Delegation: a manager decomposes tasks into dependency graphs and delegates to multiple engineer agents working in isolated git worktrees, integrating progress via branch-and-merge. The system improves accuracy by 26.7% absolute on PaperBench and 14.3% on Commit0, demonstrating that structured coordination grounded in SWE primitives outperforms simply scaling single-agent iteration budgets.

Critical review
Verdict
Bottom line

CAID presents a compelling case for branch-and-merge coordination as a foundational primitive for multi-agent software engineering. The paper demonstrates consistent gains across three LLMs and two benchmarks, with particularly strong improvements for weaker models (MiniMax 2.5 jumps from 10.4% to 36.7% on PaperBench). The central insight—that physical isolation via git worktrees prevents the interference failures that plague unstructured multi-agent systems—is well-supported by ablation studies comparing 'soft' versus worktree isolation. However, the gains come at significant cost: multi-agent execution increases API costs substantially without reducing wall-clock time proportionally, since integration remains sequential and test-gated. The paper is honest about these trade-offs in Section 6, though the strategic advice to 'adopt coordinated multi-agent execution from the outset' rather than using single-agent-first fallbacks ignores scenarios where task complexity is initially unknown.

“On PaperBench, we observe that multi-agent coordination yields large gains for weaker single-agent runs: MiniMax 2.5 reaches 36.7% under multi-agent execution, while its single-agent score is only 10.4%”
paper · Section 3.6
“On PaperBench, the pattern differs. Soft isolation drops to 55.5%, below the single-agent score of 57.2%, while 'worktree isolation' reaches 63.3%”
paper · Section 4.1
What holds up

The worktree isolation ablation is the strongest empirical result: physical separation via git worktrees is necessary for open-ended tasks, while 'soft' managerial separation suffices only when dependencies are explicit (Commit0) but fails on PaperBench. The analysis of iteration scaling in Figure 2 is rigorous—showing that doubling single-agent iterations from 100 to 200 yields marginal or negative gains ('Δ from 100 to 200 iterations remains small for GLM 4.7 and MiniMax 2.5, and becomes negative for Claude Sonnet 4.5'), whereas CAID achieves substantially larger improvements. Figure 4's trajectory analysis compellingly illustrates that delegation quality matters more than raw module coverage: runs assigning critical files like autodiff.py succeed, while those missing them fail regardless of activity elsewhere. The use of established SWE primitives (git merge, dependency graphs, asyncio) provides a principled foundation rather than ad-hoc coordination.

“doubling the iteration limit yields only marginal improvements in the final performance and, in some cases, even degraded results”
paper · Section 3.7
“the performance difference between CAID Run 1 (8.7% pass rate) and CAID Run 2 (34.3%) is not simply due to the number of modules implemented, but to which modules are assigned”
paper · Section 4.3
Main concerns

The cost-performance trade-off is stark and potentially limiting: CAID 'introduces non-trivial coordination overhead' with 'higher API cost than single-agent baselines' and wall-clock runtime that 'is not substantially reduced despite parallel execution' because integration remains sequential. The central manager relies on 'prompt engineering heuristics rather than learned delegation policies,' which creates a bottleneck—when delegation misidentifies critical dependencies (Figure 4, Run 1), the system fails despite available compute. The 'Single-Agent + CAID' sequential strategy outcome (Section 3.6) is somewhat strawmanned: showing that running both sequentially is inefficient does not preclude adaptive strategies that attempt single-agent first on simple tasks. Most critically, the evaluation focuses exclusively on SWE tasks with executable test suites; as noted in Section 6, 'not all long-horizon shared-artifact tasks possess such clearly defined boundaries or objective verification mechanisms,' limiting generalization to domains like document synthesis or research planning.

“multi-agent execution consistently incurs higher API cost than single-agent baselines, and wall-clock runtime is not substantially reduced despite parallel execution”
paper · Section 6
“task assignment relies primarily on prompt engineering heuristics rather than learned delegation policies”
paper · Section 6
“Extending CAID to non-coding domains—such as document synthesis, research planning, or multimodal artifact construction—will require adapting isolation mechanisms”
paper · Section 6
Evidence and comparison

The evidence supports the core claim that branch-and-merge coordination improves over single-agent baselines within the same OpenHands framework (v1.11.0). The controlled comparison is methodologically sound—holding constant the agent substrate while varying only coordination mechanisms. However, the paper lacks direct comparison to other multi-agent systems like MetaGPT or ChatDev, citing them as addressing different aspects (role-based pipelines versus execution isolation). The evaluation uses PaperBench Code-Dev, a 'more lightweight variant' of full PaperBench, which limits claims about full paper replication. The judge model (GPT-5-mini) for PaperBench evaluation introduces potential circularity concerns given OpenAI's involvement in that benchmark, though the Code-Dev variant's rubric-based grading mitigates this somewhat. The commit0-lite subset is standard for leaderboard comparisons, but full results are relegated to appendices.

“We build CAID using the open-source OpenHands agent SDK (Wang et al., 2024, 2025b) (v1.11.0)”
paper · Section 3.4
“We further release a variant of PaperBench called PaperBench Code-Dev for more lightweight evaluation”
Starace et al., 2025 · PaperBench abstract
Reproducibility

Reproducibility is strong: the code is available at https://github.com/JiayiGeng/async-swe-agents, built on the established OpenHands SDK. Hyperparameters are clearly specified: single-agent runs use max_iterations=100, while multi-agent uses max_iterations=50 for the manager and 80 for engineers, with 22 implementation rounds total. The paper evaluates on three models (Claude-4.5-Sonnet, GLM 4.7, MiniMax 2.5) providing coverage of both closed and open-source options. Full per-repository results are included in Appendices B (Tables 4-7). The git worktree mechanism is deterministic and well-documented in Table 1. However, the prompt engineering details (Appendix A) are only summarized without full verbatim prompts in the main text, and the dependency graph construction relies on manager LLM inferences that may vary across runs. The use of LLMSummarizingCondenser for context compression introduces some non-determinism not quantified in the evaluation.

“max_iterations=100 on both Commit0 and PaperBench. For multi-agent runs, we set max_iterations=50 for the central manager and max_iterations=80 for each software-engineer agent”
paper · Section 3.4
“We use LLMSummarizingCondenser to periodically summarize prior interaction rounds”
paper · Section 2.4
Abstract

AI agents have become increasingly capable at isolated software engineering (SWE) tasks such as resolving issues on Github. Yet long-horizon tasks involving multiple interdependent subtasks still pose challenges both with respect to accuracy, and with respect to timely completion. A natural approach to solving these long-horizon tasks in a timely manner is asynchronous multi-agent collaboration, where multiple agents work on different parts of the task at the same time. But effective application of multi-agent systems has proven surprisingly difficult: concurrent edits by multiple agents interfere with each other, dependencies are difficult to synchronize, and combining partial progress into a coherent whole is challenging. On the other hand, human developers have long relied on mature collaboration infrastructure to manage these challenges in large software projects. Inspired by these collaboration primitives, we introduce Centralized Asynchronous Isolated Delegation (CAID), a structured multi-agent coordination paradigm grounded in three core SWE primitives: centralized task delegation, asynchronous execution, and isolated workspaces. CAID constructs dependency-aware task plans through a central manager, executes subtasks concurrently in isolated workspaces, and consolidates progress via structured integration with executable test-based verification. In empirical evaluation, we find that CAID improves accuracy over single-agent baselines by 26.7% absolute on paper reproduction tasks (PaperBench) and 14.3% on Python library development tasks (Commit0). Through systematic analysis, we find that branch-and-merge is a central coordination mechanism for multi-agent collaboration, and that SWE primitives such as git worktree, git commit, and git merge enable it to be realized in a reliable and executable manner.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.