Mixture of Chapters: Scaling Learnt Memory in Transformers

cs.LG cs.AI cs.CL Tasmay Pankaj Tibrewal, Pritish Saha, Ankit Meda, Kunal Singh, Pradeep Moturi · Mar 22, 2026

What it does

Why it matters

To scale memory without prohibitive costs, the authors partition the bank into chapters and route each input sequence to a sparse subset (top-64), reducing complexity from $O(L \cdot N_m)$ to $O(L \cdot k \cdot T)$. The work demonstrates...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper tackles the lack of explicit memory mechanisms in transformers by introducing Mixture of Chapters (MoC)—a learned bank of 262K latent memory tokens accessed via cross-attention. To scale memory without prohibitive costs, the authors partition the bank into chapters and route each input sequence to a sparse subset (top-64), reducing complexity from $O(L \cdot N_m)$ to $O(L \cdot k \cdot T)$. The work demonstrates that explicit associative memory can serve as a new axis of scaling, showing improved knowledge retention when transitioning from pretraining to instruction fine-tuning.

Critical review

Verdict

Bottom line

The paper presents a compelling architectural innovation that combines learned latent memory with sparse routing inspired by Mixture-of-Experts. The iso-FLOP experimental design and detailed FLOP accounting lend credibility to the claim that explicit memory provides complementary capacity to dense parameters. However, the evaluation is limited to small-scale models (202M–371M parameters) trained on only 9.6B tokens, and the empirical comparison to competing memory architectures like Product Key Memory or Memory Layers at Scale is absent beyond literature citations.

“This enables scaling to 262K memory tokens while maintaining tractable computation.”

paper · Abstract

“We compare (i) Vanilla (iso-FLOP): a dense transformer baseline compute-matched to our memory model during pretraining”

paper · Section 6.1

What holds up

The chapter-based routing mechanism is technically sound and well-motivated by the need to scale memory capacity independently of compute. The mathematical formulation is clear: partitioning memory $M$ into $C$ chapters of size $T$ (where $N_m = C \cdot T$) and selecting subsets via $\mathcal{S} = \text{TopK}(p, k)$ preserves the associative retrieval properties of attention while controlling costs. The retention experiments provide strong evidence for the paper's central claim—under heavy instruction fine-tuning (230M tokens), the dense baseline suffers catastrophic forgetting on ARC-Challenge (-6.69 pp) and BoolQ (-6.24 pp), while MoC remains stable (-2.68 pp and +0.24 pp respectively).

“This reduces memory attention cost from $O(L N_m)$ to $O(L k T)$ per memory layer while preserving attention-based associative retrieval.”

paper · Section 5.2

“Vanilla (iso-FLOP) ARC-C $\Delta$ (pp) -6.69... MoC ARC-C $\Delta$ (pp) -2.68”

paper · Table 2

Main concerns

The empirical validation is narrow, relying on only four benchmarks (MMLU, ARC-C, BoolQ, OpenBookQA) without probing long-context retrieval, factual recall precision, or robustness to distribution shift. The lack of direct empirical comparison to Product Key Memory (Lample et al.) or Memory Layers at Scale (Berges et al.)—the paper's closest architectural relatives—makes it difficult to assess whether chapter routing outperforms existing sparse memory lookup mechanisms. Additionally, the router's sequence-level granularity (selecting chapters once per sequence rather than per token) may limit expressiveness, and the paper does not analyze load balancing collapse or router specialization despite using auxiliary losses. The claim that freezing the memory bank during IFT yields identical performance to updating it ("post-IFT benchmark scores remain within noise") suggests the bank becomes static after pretraining, raising questions about the mechanism's utility for continual learning beyond the initial anchoring effect.

“We build on this direction, but use a learned latent-token memory bank accessed via cross-attention and scale it with sequence-level chapter routing.”

paper · Section 3

“The curves overlap closely, and post-IFT benchmark scores remain within noise across these settings”

paper · Section 6.2

Evidence and comparison

The iso-FLOP methodology is rigorous, with analytic calculations showing MoC uses 1.378T FLOPs per sequence compared to 1.487T for the vanilla iso-FLOP baseline. The evidence supports the claim that explicit memory improves retention under continued training, as shown by the smaller performance deltas after IFT. However, the paper does not empirically demonstrate superiority over other memory-augmented architectures cited as related work (e.g., Memformer, PKM, or Titans). The comparison is limited to dense transformers, leaving open whether the gains come from the memory mechanism specifically or simply from increased parameter count (371M total vs 202M dense), despite the FLOP-matching effort.

“Forward+Backward FLOPs: Vanilla (iso-FLOP) 1.487T; Mixture of Chapters (MoC) 1.378T”

paper · Appendix A.2

“Our work is closest to learned, scalable internal memory modules integrated into transformers. Product Key Memory (PKM) provides a classic recipe for large trainable key-value memory... Memory Layers at Scale shows that trainable memory layers can add substantial capacity”

paper · Section 3

Reproducibility

The paper provides substantial implementation detail including full hyperparameter tables (learning rates, model dimensions, optimizer settings), analytic FLOP derivations, and a GitHub code link. The FLOP accounting is transparent, breaking down costs for self-attention ($6.5 \times 10^9$ per layer), router computation ($7.1 \times 10^6$), and memory attention ($2.6 \times 10^{10}$). However, the exact data preprocessing steps, evaluation prompts for the benchmarks, and random seed information are not specified. The heavy compute requirements (8 GPUs for pretraining) may limit independent reproduction, though the methodology is sufficiently documented that the experiments could be replicated with equivalent resources.

“Training duration: 9600 steps; Optimizer: AdamW; Learning rates: base model $3\times 10^{-4}$, memory layers $6\times 10^{-4}$, memory bank $6\times 10^{-4}$”

paper · Appendix A.4.2

“Code is available at https://github.com/Tasmay-Tibrewal/Memory”

paper · Footnote 2

Abstract

Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that transformer layers query via cross-attention to retrieve stored knowledge. To scale memory capacity without prohibitive attention costs, we propose chapter-based routing inspired by Mixture-of-Experts architectures, partitioning the memory bank into chapters and training a router to select relevant subsets per input. This enables scaling to 262K memory tokens while maintaining tractable computation. We evaluate our approach against standard transformers (in iso-FLOP settings) on pre-training and instruction fine-tuning across relevant benchmarks. Our models surpass iso-FLOP baselines suggesting scope for a new axis of scaling, demonstrating that explicit associative memory provides complementary capacity to what is captured implicitly in model parameters. Additionally, we observe improved knowledge retention under continued training, with robustness to forgetting when transitioning between training phases (e.g., pretraining to instruction fine-tuning).

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.