LLM-based Automated Architecture View Generation: Where Are We Now?
The paper tackles the labor-intensive challenge of creating software architecture views, which are essential for documentation but often become outdated—75\% are never updated after creation. The authors conduct a large-scale empirical study evaluating whether LLMs and agentic approaches can automate view generation from source code, testing 3 LLMs across 3 prompting strategies and 2 agentic approaches on 340 repositories. This matters because as systems grow complex, automated view generation could bridge the gap between implementation and architectural documentation, potentially alleviating the manual burden that leads to outdated artifacts.
The study presents a rigorous large-scale evaluation demonstrating that while LLMs can produce syntactically valid architecture views, substantial quality gaps persist—particularly in completeness and consistency where failure rates exceed 70\% even for the best approaches. The custom-built ArchView agent outperforms general-purpose agents and prompting strategies, achieving 22.6\% clarity failure rates and 50\% level-of-detail success, yet the overall results suggest LLMs remain assistive tools rather than autonomous architects. As the authors conclude, "they consistently exhibit granularity mismatches, operating at the code level rather than architectural abstractions."
The methodology is robust: 340 repositories yielding 4,137 generated views across 13 experimental configurations provides strong statistical power. The hybrid evaluation combining automated LLM-as-a-Judge metrics for clarity, completeness, and consistency with human evaluation for accuracy and level of detail captures both structural and semantic dimensions effectively. The clear performance hierarchy provides actionable evidence that domain-specific agentic workflows outperform generic coding agents, with results showing "The custom agentic approach consistently outperforms the general purpose agent, achieving the best clarity (22.6\% failure rate) and level of detail success (50\%)."
Despite syntactic validity, all approaches struggle with architectural abstraction, consistently operating at code-level granularity rather than capturing high-level architectural concerns. Completeness and consistency remain critical weaknesses; the authors note that "while 266 of the 340 generated views closely matched the ground truth in terms of clarity (AV), they struggled significantly with completeness and consistency." Additionally, the LLM-as-a-Judge evaluation method raises validity concerns—the authors admit the evaluator was "too strict occasionally(26\%), frequently rejecting valid architectural abstractions"—and structural similarity metrics prove misleading, as the failing general-purpose agent achieved moderate SSIM (0.55) despite near-total semantic failure.
The evidence convincingly supports the central claim that current LLMs generate syntactically valid but semantically flawed views, with the divergence between SSIM and LLM quality metrics revealing critical gaps in pixel-based evaluation approaches. The authors observe that "GPA achieved moderate visual similarity (0.498-0.594) but worst semantic correctness (0.067-0.132), confirming pixel-level similarity does not guarantee architectural accuracy." The comparison to related work is fair and well-positioned: the authors distinguish their contribution from prior low-level UML generation studies by explicitly targeting high-level architectural abstractions, correctly identifying this as the first systematic evaluation of LLM-based architecture view generation from source code.
The study demonstrates strong reproducibility practices with a comprehensive replication package available on Zenodo containing all datasets, prompts, and scripts. The authors explicitly state their commitment: "A comprehensive replication package, including all datasets, prompts, and scripts to ensure reproducibility and support future research." The use of established datasets (Migliorini et al.'s 15,000-view repository) and standardized non-parametric statistical tests (Friedman, Wilcoxon signed-rank) supports replication, though the complexity of the hierarchical summarization pipeline and dependency on specific proprietary LLM API versions may pose challenges for exact reproduction as models evolve or become unavailable.
Architecture views are essential for software architecture documentation, yet their manual creation is labor intensive and often leads to outdated artifacts. As systems grow in complexity, the automated generation of views from source code becomes increasingly valuable. Goal: We empirically evaluate the ability of LLMs and agentic approaches to generate architecture views from source code. Method: We analyze 340 open-source repositories across 13 experimental configurations using 3 LLMs with 3 prompting techniques and 2 agentic approaches, yielding 4,137 generated views. We evaluate the generated views by comparing them with the ground-truth using a combination of automated metrics complemented by human evaluations. Results: Prompting strategies offer marginal improvements. Few-shot prompting reduces clarity failures by 9.2% compared to zero-shot baselines. The custom agentic approach consistently outperforms the general-purpose agent, achieving the best clarity (22.6% failure rate) and level-of-detail success (50%). Conclusions: LLM and agentic approaches demonstrate capabilities in generating syntactically valid architecture views. However, they consistently exhibit granularity mismatches, operating at the code level rather than architectural abstractions. This suggests that there is still a need for human expertise, positioning LLMs and agents as assistive tools rather than autonomous architects.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.