Generalizable Self-Evolving Memory for Automatic Prompt Optimization

cs.CL Guanbao Liang, Yuanchen Bei, Sheng Zhou, Yuheng Qin, Huan Zhou, Bingxin Jia, Bin Li, Jiajun Bu · Mar 23, 2026
Local to this browser
What it does
MemAPO addresses a critical limitation in automatic prompt optimization (APO): existing methods frame optimization as an isolated search for task-specific prompts, preventing knowledge reuse across tasks. The paper proposes reframing APO...
Why it matters
MemAPO addresses a critical limitation in automatic prompt optimization (APO): existing methods frame optimization as an isolated search for task-specific prompts, preventing knowledge reuse across tasks. The paper proposes reframing APO...
Main concern
The dual-memory approach is conceptually sound and empirically effective, achieving the best average performance (70. 7%) across six diverse reasoning benchmarks on GPT-4o-mini, outperforming TextGrad by 7.
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

MemAPO addresses a critical limitation in automatic prompt optimization (APO): existing methods frame optimization as an isolated search for task-specific prompts, preventing knowledge reuse across tasks. The paper proposes reframing APO as a continual experience accumulation process using a dual-memory mechanism—Correct-Template Memory ($\mathcal{E}_{\mathrm{CTM}}$) for successful strategies and Error-Pattern Memory ($\mathcal{E}_{\mathrm{EPM}}$) for failure modes—that enables cross-task generalization while reducing optimization costs by approximately 57% compared to strong baselines.

Critical review
Verdict
Bottom line

The dual-memory approach is conceptually sound and empirically effective, achieving the best average performance (70.7%) across six diverse reasoning benchmarks on GPT-4o-mini, outperforming TextGrad by 7.1%. However, the evaluation relies on single-run experiments without variance reporting, and the cost advantages stem partly from the architectural shift toward amortized memory rather than per-task optimization, complicating direct comparisons.

“On GPT-4o-mini, our method achieves the best average performance 70.7% across all tasks, outperforming the strongest baseline TextGrad by 7.1%”
Section 4.2 · Section 4.2
What holds up

The dual-memory design is well-motivated by schema theory and validated through ablations showing both CTM and EPM contribute independently to performance gains (CTM improves AQuA-RAT by 18.8 points, EPM by 16.2 points). The framework successfully demonstrates cross-domain generalization when mixing mathematical and knowledge-intensive tasks, where baselines degrade significantly while MemAPO maintains improvements. The cost reduction is substantial, reducing API calls from thousands (TextGrad) to hundreds per task.

“CTM improves the performance by 18.8 and 21.6 on AQuA-RAT and Gaokao MathQA, respectively, while EPM yields improvements of 16.2 and 20.8”
Section 4.3 · Table 3
Main concerns

The experimental protocol lacks statistical rigor: "In the experiments we report, we conducted a single run." This absence of variance estimates undermines reliability claims. The evaluation uses small test sets (100-251 samples per task), raising scalability concerns. The template update mechanism employs a strict verification step requiring updated strategies to succeed on all 3 sampled cases, which may overly constrain memory evolution but receives no sensitivity analysis. Additionally, the framework is limited to textual reasoning without multimodal support.

“In the experiments we report, we conducted a single run.”
Appendix A.3 · Appendix A.3
“The current framework focuses on textual reasoning tasks and prompt optimization in language-based settings, without explicitly modeling multimodal information”
Limitations section · Section 5
Evidence and comparison

Performance claims are supported by comparisons across logical, mathematical, and knowledge-intensive domains on two model families (GPT-4o-mini and Qwen3-8B). However, the cost comparison advantages MemAPO by design—it amortizes memory construction across the training set while traditional methods optimize from scratch per task. The cross-domain generalization test (Table 2) mixes Gaokao MathQA and History tasks, showing MemAPO avoids the 11.1% degradation seen in SPO, though broader multi-domain testing would strengthen the claim.

“This indicates that a specific optimized prompt struggles to reconcile the conflicting reasoning demands of different domains. In contrast, MemAPO maintains improvements on two tasks”
Section 4.2 · Table 2 caption
Reproducibility

Implementation details are thorough: all hyperparameters (top-3 templates, $\theta_{\mathrm{corr}}=0.3$, max 3 reflection retries) and the complete set of meta-prompts (Figures 5-11) are provided. The setup uses standard APIs and ChromaDB vector stores. However, the single-run protocol prevents assessment of result stability. While the paper states "The codes and experiment results will be released publicly," no repository URL is provided in the text, blocking immediate reproduction. The reliance on proprietary models (GPT-4o-mini, GPT-5-chat) and specific embedding models (Qwen3-Embedding-8B) is documented.

“we conducted a single run”
Appendix A.3 · Appendix A.3
“The codes and experiment results will be released publicly with clear documentation”
Appendix A.7 · Appendix A.7
Abstract

Automatic prompt optimization is a promising approach for adapting large language models (LLMs) to downstream tasks, yet existing methods typically search for a specific prompt specialized to a fixed task. This paradigm limits generalization across heterogeneous queries and prevents models from accumulating reusable prompting knowledge over time. In this paper, we propose MemAPO, a memory-driven framework that reconceptualizes prompt optimization as generalizable and self-evolving experience accumulation. MemAPO maintains a dual-memory mechanism that distills successful reasoning trajectories into reusable strategy templates while organizing incorrect generations into structured error patterns that capture recurrent failure modes. Given a new prompt, the framework retrieves both relevant strategies and failure patterns to compose prompts that promote effective reasoning while discouraging known mistakes. Through iterative self-reflection and memory editing, MemAPO continuously updates its memory, enabling prompt optimization to improve over time rather than restarting from scratch for each task. Experiments on diverse benchmarks show that MemAPO consistently outperforms representative prompt optimization baselines while substantially reducing optimization cost.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.