Generalizable Self-Evolving Memory for Automatic Prompt Optimization
MemAPO addresses a critical limitation in automatic prompt optimization (APO): existing methods frame optimization as an isolated search for task-specific prompts, preventing knowledge reuse across tasks. The paper proposes reframing APO as a continual experience accumulation process using a dual-memory mechanism—Correct-Template Memory ($\mathcal{E}_{\mathrm{CTM}}$) for successful strategies and Error-Pattern Memory ($\mathcal{E}_{\mathrm{EPM}}$) for failure modes—that enables cross-task generalization while reducing optimization costs by approximately 57% compared to strong baselines.
The dual-memory approach is conceptually sound and empirically effective, achieving the best average performance (70.7%) across six diverse reasoning benchmarks on GPT-4o-mini, outperforming TextGrad by 7.1%. However, the evaluation relies on single-run experiments without variance reporting, and the cost advantages stem partly from the architectural shift toward amortized memory rather than per-task optimization, complicating direct comparisons.
The dual-memory design is well-motivated by schema theory and validated through ablations showing both CTM and EPM contribute independently to performance gains (CTM improves AQuA-RAT by 18.8 points, EPM by 16.2 points). The framework successfully demonstrates cross-domain generalization when mixing mathematical and knowledge-intensive tasks, where baselines degrade significantly while MemAPO maintains improvements. The cost reduction is substantial, reducing API calls from thousands (TextGrad) to hundreds per task.
The experimental protocol lacks statistical rigor: "In the experiments we report, we conducted a single run." This absence of variance estimates undermines reliability claims. The evaluation uses small test sets (100-251 samples per task), raising scalability concerns. The template update mechanism employs a strict verification step requiring updated strategies to succeed on all 3 sampled cases, which may overly constrain memory evolution but receives no sensitivity analysis. Additionally, the framework is limited to textual reasoning without multimodal support.
Performance claims are supported by comparisons across logical, mathematical, and knowledge-intensive domains on two model families (GPT-4o-mini and Qwen3-8B). However, the cost comparison advantages MemAPO by design—it amortizes memory construction across the training set while traditional methods optimize from scratch per task. The cross-domain generalization test (Table 2) mixes Gaokao MathQA and History tasks, showing MemAPO avoids the 11.1% degradation seen in SPO, though broader multi-domain testing would strengthen the claim.
Implementation details are thorough: all hyperparameters (top-3 templates, $\theta_{\mathrm{corr}}=0.3$, max 3 reflection retries) and the complete set of meta-prompts (Figures 5-11) are provided. The setup uses standard APIs and ChromaDB vector stores. However, the single-run protocol prevents assessment of result stability. While the paper states "The codes and experiment results will be released publicly," no repository URL is provided in the text, blocking immediate reproduction. The reliance on proprietary models (GPT-4o-mini, GPT-5-chat) and specific embedding models (Qwen3-Embedding-8B) is documented.
Automatic prompt optimization is a promising approach for adapting large language models (LLMs) to downstream tasks, yet existing methods typically search for a specific prompt specialized to a fixed task. This paradigm limits generalization across heterogeneous queries and prevents models from accumulating reusable prompting knowledge over time. In this paper, we propose MemAPO, a memory-driven framework that reconceptualizes prompt optimization as generalizable and self-evolving experience accumulation. MemAPO maintains a dual-memory mechanism that distills successful reasoning trajectories into reusable strategy templates while organizing incorrect generations into structured error patterns that capture recurrent failure modes. Given a new prompt, the framework retrieves both relevant strategies and failure patterns to compose prompts that promote effective reasoning while discouraging known mistakes. Through iterative self-reflection and memory editing, MemAPO continuously updates its memory, enabling prompt optimization to improve over time rather than restarting from scratch for each task. Experiments on diverse benchmarks show that MemAPO consistently outperforms representative prompt optimization baselines while substantially reducing optimization cost.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.