SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

cs.LG cs.AI cs.CL Kexian Tang, Jiani Wang, Shaowen Wang, Kaifeng Lyu · Mar 23, 2026
Local to this browser
What it does
Large language models often lack coverage in specialized, data-scarce domains where web text is limited. This paper proposes SPA (Scaling Prompt-engineered Augmentation), a baseline that generates large-scale synthetic corpora using just...
Why it matters
This paper proposes SPA (Scaling Prompt-engineered Augmentation), a baseline that generates large-scale synthetic corpora using just seven carefully designed prompt templates grounded in cognitive learning strategies (Concept Learning,...
Main concern
SPA establishes itself as a compelling and rigorous baseline for synthetic data generation in knowledge injection tasks. The paper demonstrates through token-matched evaluations that a fixed set of seven human-curated prompts, when scaled...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Large language models often lack coverage in specialized, data-scarce domains where web text is limited. This paper proposes SPA (Scaling Prompt-engineered Augmentation), a baseline that generates large-scale synthetic corpora using just seven carefully designed prompt templates grounded in cognitive learning strategies (Concept Learning, Critical Thinking, and Generative Learning). The core finding is that this simple approach consistently outperforms complex RL-based methods like SEAL and multi-stage pipelines like EntiGraph across Wikipedia QA, long-document comprehension, and multi-hop reasoning benchmarks, suggesting that careful prompt design combined with straightforward scaling is surprisingly effective for knowledge injection.

Critical review
Verdict
Bottom line

SPA establishes itself as a compelling and rigorous baseline for synthetic data generation in knowledge injection tasks. The paper demonstrates through token-matched evaluations that a fixed set of seven human-curated prompts, when scaled to hundreds of millions of tokens, can match or exceed the performance of significantly more complex methods including reinforcement learning-based approaches (SEAL) and entity-graph pipelines (EntiGraph). The work provides credible empirical evidence for two key limitations of prior approaches: diversity collapse in RL-based generators at scale and the diminishing returns of multi-stage prompting compared to single-stage augmentation with optimized prompts.

“careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective”
SPA paper · Abstract
“SPA consistently improves with scale and achieves the highest accuracy at moderate-to-large token budgets”
SPA paper · Section 5.1
What holds up

The scaling analysis is thorough and convincing: SPA shows consistent performance gains when scaling synthetic data up to 4000$\times$ the original corpus size on SQuAD (reaching 91.27% accuracy), while SEAL saturates at 74.23%. The diversity evaluation (Table 3) robustly supports the claim that RL-based methods suffer diversity collapse, with SEAL exhibiting a Self-BLEU score of 0.0058 and Compression Ratio of 19.25 compared to SPA's 0.0010 and 4.38, respectively. The cross-benchmark consistency—spanning SQuAD, QuALITY, and MultiHop-RAG with different model families (Qwen2.5-7B and Meta-Llama-3-8B)—demonstrates that the method generalizes beyond specific architectures or domains.

“SEAL exhibits substantially lower diversity than all other methods across all four metrics on SQuAD”
SPA paper · Table 3
“SEAL's performance saturates while SPA continues to improve, ultimately outperforming SEAL by a large margin at 120M tokens (91.27% vs. 74.23%)”
SPA paper · Section 5.1
Main concerns

While highly effective, SPA relies on seven manually curated prompts designed based on cognitive science principles (Concept Learning, Critical Thinking, Generative Learning), which raises questions about scalability to radically different domains without similar pedagogical expertise; the paper acknowledges in Section 6.3 that 'the optimal prompt configuration does not transfer across tasks.' Additionally, the comparisons across benchmarks use different generator models (Qwen2.5-7B for SQuAD, gpt-oss-120b for QuALITY, GPT-4o-mini for MultiHop-RAG), complicating the isolation of SPA's contribution from base generator capabilities, though Table 2 partially addresses this with controlled generator comparisons. The evaluation also focuses on factual QA tasks, leaving open questions about performance on domains requiring intensive numerical reasoning or rapidly evolving knowledge.

“However, the optimal prompt configuration does not transfer across tasks”
SPA paper · Section 6.3
“However, there may exist more challenging scenarios that we do not cover, such as domains requiring intensive numerical reasoning or rapidly evolving knowledge”
SPA paper · Section 7
Evidence and comparison

The evidence strongly supports the claim that SPA outperforms prior work at scale, with rigorous token-matched evaluations showing SPA achieving 91.27% on SQuAD versus Active Reading (90.25%) and SEAL (74.23%), and 57.03% on QuALITY versus EntiGraph (56.22%). The comparison with SEAL is nuanced: SEAL initially outperforms SPA at small scales (as expected for an RL method optimizing downstream task accuracy) but exhibits diminishing returns due to diversity collapse, whereas SPA benefits from continued scaling. The analysis of Active Reading in Section 5.2 provides compelling evidence that multi-stage prompting can underperform single-stage approaches when intermediate strategy generation is suboptimal, as SPA's fixed prompts achieved higher average strategy effectiveness than Active Reading's document-specific strategies.

“SPA consistently achieves the strongest performance across all three benchmarks”
SPA paper · Table 1
“Across all five documents, SPA's prompts consistently achieve higher average QA accuracy than the strategies produced by Active Reading”
SPA paper · Section 5.2
Reproducibility

The authors commit to reproducibility by releasing code at https://github.com/Tangkexian/SPA and providing exhaustive experimental details. All seven prompt templates are reproduced verbatim in Appendix A with both instruction-tuned and base model variants, and hyperparameters are detailed in Appendix B (SQuAD) and Appendix D (QuALITY), including learning rate schedules, batch sizes, and context lengths. However, exact reproduction may be hindered by reliance on specific proprietary or hard-to-replicate generators (particularly gpt-oss-120b and GPT-4o) and the subjective, cognitively-grounded process of prompt design which lacks an automated derivation procedure for new domains.

“Our code is available at https://github.com/Tangkexian/SPA”
SPA paper · Abstract
“We list the seven prompt templates used in SPA, ordered according to the learning levels described in Section 3”
SPA paper · Appendix A
Abstract

While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at https://github.com/Tangkexian/SPA.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.