Mixture of Chapters: Scaling Learnt Memory in Transformers
This paper tackles the lack of explicit memory mechanisms in transformers by introducing Mixture of Chapters (MoC)—a learned bank of 262K latent memory tokens accessed via cross-attention. To scale memory without prohibitive costs, the authors partition the bank into chapters and route each input sequence to a sparse subset (top-64), reducing complexity from $O(L \cdot N_m)$ to $O(L \cdot k \cdot T)$. The work demonstrates that explicit associative memory can serve as a new axis of scaling, showing improved knowledge retention when transitioning from pretraining to instruction fine-tuning.
The paper presents a compelling architectural innovation that combines learned latent memory with sparse routing inspired by Mixture-of-Experts. The iso-FLOP experimental design and detailed FLOP accounting lend credibility to the claim that explicit memory provides complementary capacity to dense parameters. However, the evaluation is limited to small-scale models (202M–371M parameters) trained on only 9.6B tokens, and the empirical comparison to competing memory architectures like Product Key Memory or Memory Layers at Scale is absent beyond literature citations.
The chapter-based routing mechanism is technically sound and well-motivated by the need to scale memory capacity independently of compute. The mathematical formulation is clear: partitioning memory $M$ into $C$ chapters of size $T$ (where $N_m = C \cdot T$) and selecting subsets via $\mathcal{S} = \text{TopK}(p, k)$ preserves the associative retrieval properties of attention while controlling costs. The retention experiments provide strong evidence for the paper's central claim—under heavy instruction fine-tuning (230M tokens), the dense baseline suffers catastrophic forgetting on ARC-Challenge (-6.69 pp) and BoolQ (-6.24 pp), while MoC remains stable (-2.68 pp and +0.24 pp respectively).
The empirical validation is narrow, relying on only four benchmarks (MMLU, ARC-C, BoolQ, OpenBookQA) without probing long-context retrieval, factual recall precision, or robustness to distribution shift. The lack of direct empirical comparison to Product Key Memory (Lample et al.) or Memory Layers at Scale (Berges et al.)—the paper's closest architectural relatives—makes it difficult to assess whether chapter routing outperforms existing sparse memory lookup mechanisms. Additionally, the router's sequence-level granularity (selecting chapters once per sequence rather than per token) may limit expressiveness, and the paper does not analyze load balancing collapse or router specialization despite using auxiliary losses. The claim that freezing the memory bank during IFT yields identical performance to updating it ("post-IFT benchmark scores remain within noise") suggests the bank becomes static after pretraining, raising questions about the mechanism's utility for continual learning beyond the initial anchoring effect.
The iso-FLOP methodology is rigorous, with analytic calculations showing MoC uses 1.378T FLOPs per sequence compared to 1.487T for the vanilla iso-FLOP baseline. The evidence supports the claim that explicit memory improves retention under continued training, as shown by the smaller performance deltas after IFT. However, the paper does not empirically demonstrate superiority over other memory-augmented architectures cited as related work (e.g., Memformer, PKM, or Titans). The comparison is limited to dense transformers, leaving open whether the gains come from the memory mechanism specifically or simply from increased parameter count (371M total vs 202M dense), despite the FLOP-matching effort.
The paper provides substantial implementation detail including full hyperparameter tables (learning rates, model dimensions, optimizer settings), analytic FLOP derivations, and a GitHub code link. The FLOP accounting is transparent, breaking down costs for self-attention ($6.5 \times 10^9$ per layer), router computation ($7.1 \times 10^6$), and memory attention ($2.6 \times 10^{10}$). However, the exact data preprocessing steps, evaluation prompts for the benchmarks, and random seed information are not specified. The heavy compute requirements (8 GPUs for pretraining) may limit independent reproduction, though the methodology is sufficiently documented that the experiments could be replicated with equivalent resources.
Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that transformer layers query via cross-attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter-based routing inspired by Mixture-of-Experts architectures, partitioning the memory bank into chapters and training a router to select relevant subsets per input. This enables scaling to 262K memory tokens while maintaining tractable computation. We evaluate our approach against standard transformers (in iso-FLOP settings) on pre-training and instruction fine-tuning across relevant benchmarks. Our models surpass iso-FLOP baselines suggesting scope for a new axis of scaling, demonstrating that explicit associative memory provides complementary capacity to what is captured implicitly in model parameters. Additionally, we observe improved knowledge retention under continued training, with robustness to forgetting when transitioning between training phases (e.g., pretraining to instruction fine-tuning).
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.