AdditiveLLM2: A Multi-modal Large Language Model for Additive Manufacturing
AdditiveLLM2 is a domain-adapted multi-modal LLM for additive manufacturing built by fine-tuning Gemma 3 12B on ~50 million tokens from open-access AM journal articles. The work addresses the challenge of specializing general LLMs for technical domains without consuming context window space (as with RAG) or requiring massive datasets. Using domain adaptive pretraining (DAPT) for both text and vision plus visual instruction tuning (VIT), the authors demonstrate that even relatively small curated datasets can yield domain expertise exceeding 90% accuracy on AM knowledge tasks.
The paper presents a competent but limited demonstration of domain adaptation for additive manufacturing. The three-stage training pipeline (text DAPT → image DAPT → VIT) is methodologically sound and the open-sourced AdditiveLLM2-OA dataset represents a genuine contribution. However, the small dataset size (~50M tokens vs. billions in comparable work), reliance on synthetic data generation (GPT-OSS) for training labels and evaluation, and limited statistical testing (5 trials) temper the claims of robust domain specialization. The work succeeds as a proof-of-concept for efficient specialization but lacks the empirical rigor to fully support claims of outperforming generalist models on complex engineering reasoning.
The multi-stage adaptation strategy is well-motivated and appropriately executed using LoRA ($r=16$, $\alpha=32$) to preserve computational efficiency. Section 4.3 describes freezing language weights during vision DAPT, which is sound practice to prevent catastrophic forgetting. The Additive-Manufacturing-Benchmark is a comprehensive evaluation suite covering both language tasks (multiple choice, short answer) and vision tasks (FDM defect detection, LPBF anomaly identification), with tasks sourced from published experimental data like MeltpoolNet and Peregrine. The authors appropriately note that "the base model was not selected as the 'best' model for a specific task in any of the cases," showing consistent gains from domain adaptation.
The primary concern is dataset scale and quality. The authors use only ~50 million tokens (1,704 journal articles) compared to Gururangan et al.'s 24 billion tokens for DAPT, raising questions about coverage of AM knowledge. The paper acknowledges this discrepancy but attributes success to the narrower domain scope—this remains speculative. More critically, training stability issues appear: "performance at the DAPT image training stage noticeably decreases" in both language and vision tasks, suggesting potential training instability or data quality issues during the vision adaptation phase. The reliance on GPT-OSS 120B to generate VIT examples from figure captions creates a dependency on proprietary models for training data, and using GPT-OSS 20B to grade short answers introduces evaluator bias. Sample sizes for benchmark tasks are modest (127 questions, 100 FDM samples), and with only 5 evaluation trials, the statistical significance of reported improvements is unclear.
The evidence supports modest domain adaptation gains but not revolutionary performance. The comparison to Gururangan et al. (24B tokens on RoBERTa-base) is fair in spirit but the two-order-of-magnitude difference in dataset size makes direct comparison difficult—the authors argue that AM is a narrower domain, but provide no evidence that 50M tokens capture sufficient AM knowledge versus the general domains in prior work. The benchmark mixes evaluation modes: multiple choice achieves 93% accuracy post-adaptation (from 88% baseline), but short answer scoring relies on automated rubrics graded by GPT-OSS, which may favor models trained on similar data. Notably absent are comparisons to commercial models like GPT-4V or Claude on the same benchmark, which would contextualize whether domain adaptation outperforms simply using capable generalist VLMs with good prompting. The melt pool prediction task uses RMSE, but the error magnitudes relative to process tolerances are not discussed.
Reproducibility is mixed. The AdditiveLLM2-OA dataset is publicly hosted on HuggingFace with extracted text, images, and VIT examples, which is excellent. The base model (Gemma 3 12B IT) and LoRA configuration ($r=16$, $\alpha=32$) are clearly specified. However, critical details are missing: random seeds for the 5 evaluation trials, exact train/validation splits beyond "95% train and 5% validation," and the specific prompt templates used for VIT. The use of GPT-OSS (20B and 120B) to generate training labels and evaluate answers creates a barrier to full reproduction since this requires access to specific model versions. Training took ~36 hours per stage on 3x NVIDIA A6000 GPUs, but batch sizes and learning rates are not specified, making computational budget estimation difficult. The code for the training pipeline and benchmark evaluation is not mentioned as publicly available.
This work presents AdditiveLLM2 a multi-modal, domain adapted large language model built upon the instruction tuned variant of the Gemma 3 model using a relatively small dataset of around 50 million tokens. The dataset (AdditiveLLM2-OA) consists of open-access additive manufacturing journal articles with data extracted for the domain adaptive pretraining and visual instruction tuning processes. Various stages of the developed model are evaluated with the Additive-Manufacturing-Benchmark which consists of additive manufacturing domain specific tasks compiled published resources. AdditiveLLM2 exhibits proficiency in both language and vision based tasks, achieving accuracies upwards of 90% in general additive manufacturing knowledge. This domain adaptive pretraining and instruction tuning strategy outline an accessible specialization method for large language models to a domain such as additive manufacturing.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.