A Comparative Analysis of LLM Memorization at Statistical and Internal Levels: Cross-Model Commonalities and Model-Specific Signatures

cs.CL cs.LG Bowen Chen, Namgi Han, Yusuke Miyao · Mar 23, 2026
Local to this browser
What it does
This paper presents a large-scale comparative study of memorization across six open LLM families (Pythia, OLMo1/2/3, OpenLLaMA, StarCoder) ranging from 1B to 32B parameters. By analyzing both statistical patterns and internal mechanisms...
Why it matters
By analyzing both statistical patterns and internal mechanisms (attention heads, layer decoding), it identifies universal behaviors—such as log-linear scaling of memorization rates with model size and high compressibility of memorized...
Main concern
The paper provides a valuable cross-model analysis of LLM memorization, successfully identifying both universal scaling laws and family-specific architectural signatures. The dual approach combining statistical analysis (compression...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper presents a large-scale comparative study of memorization across six open LLM families (Pythia, OLMo1/2/3, OpenLLaMA, StarCoder) ranging from 1B to 32B parameters. By analyzing both statistical patterns and internal mechanisms (attention heads, layer decoding), it identifies universal behaviors—such as log-linear scaling of memorization rates with model size and high compressibility of memorized sequences—while revealing family-specific signatures in memorization structure. The work bridges isolated findings from single-model studies to establish general principles of how transformers memorize training data.

Critical review
Verdict
Bottom line

The paper provides a valuable cross-model analysis of LLM memorization, successfully identifying both universal scaling laws and family-specific architectural signatures. The dual approach combining statistical analysis (compression ratios, frequency thresholds) with mechanistic interpretability (attention ablation, logit lens) offers a comprehensive view of memorization phenomena. The finding that 'memorization-important heads highly overlap within domains' while 'the distribution of those important heads differs between families' represents a meaningful contribution to understanding model-specific inductive biases.

“memorization-important heads highly overlap within domains, and a small subset of heads is important for most domains”
Chen et al. · Section 1
“the distribution of those important heads differs between families, showing a unique family-level feature”
Chen et al. · Section 1
What holds up

The scale of the study is impressive, covering 20 models across six families with varying architectures and training corpora. The identification of log-linear scaling between model size and memorization rate and the high compressibility of memorized sequences (often requiring $\leq 50\%$ of original tokens) are robust findings supported by extensive data. The internal analysis revealing that 'memorized sequences exhibit lower similarity recovery compared to unmemorized sequences' provides concrete evidence for the hypothesis that memorization relies on specific computational pathways.

“Memorization rates universally scale log-linearly with parameter size. Furthermore, memorized sequences are intrinsically highly compressed (even using $\leq 50\%$ of original tokens could obtain the same memorized output)”
Chen et al. · Section 1
“Crucially, however, memorized sequences exhibit lower similarity recovery compared to unmemorized sequences, showing that memorized sequences are more sensitive to the perturbation”
Chen et al. · Section 4.4
Main concerns

The study is necessarily limited to models with publicly available training data, excluding popular families like LLaMA, GPT, and DeepSeek, which limits the generality of claims about universal LLM behavior. The sampling methodology (300k sequences per domain) may introduce selection bias, and the head ablation study uses reduced sample sizes for larger models ($2{,}500$ vs $10{,}000$ examples) due to computational constraints, potentially affecting statistical power. Additionally, the frequency analysis relies on Infini-gram's Llama-2 tokenizer which mismatches the target models' tokenizers, introducing approximation errors that the authors acknowledge but do not fully quantify.

“most of the famous open models (Qwen, Deepseek, Llama, etc) do not release their pre-training data, so our experiments are not feasible for those LLMs”
Chen et al. · Limitations section
“For large models (above 7b), we sample 2,500 memorized examples in each domain”
Chen et al. · Appendix A.1.1
“Infini-gram mainly uses the Llama-2 tokenizer for its n-gram database. On the contrary, we use the corresponding tokenizer of each model... the Llama-2 tokenizer tends to give shorter tokenized lists compared to the tokenizer of the original corresponding model”
Chen et al. · Appendix A.1.3
Evidence and comparison

The paper positions itself against prior single-model studies and successfully demonstrates that findings like log-linear scaling hold across diverse architectures. However, comparisons to related work on mechanistic memorization are somewhat superficial—the paper acknowledges studies on knowledge neurons and intrinsic dimension but does not deeply engage with how their specific findings relate to the observed head importance distributions. The claim that 'the memorization structure is decided by the training recipe of each model family' is well-supported by the layer importance similarity heatmap, but the causal attribution to specific training components remains speculative.

“This suggests that the memorization structure of a model family is shared among its models, regardless of size. However, there does not exist a universal memorization structure that exists for all LLMs, and the memorization structure is decided by the training recipe (model, data, training algorithm) of each model family”
Chen et al. · Section 4.8
Reproducibility

The study relies exclusively on open-weight models with documented training data, which aids reproducibility. However, the authors note that 'to generate all memorization scores across 6 noise strengths for all models... it takes around 2 months for 8 A100 servers', creating a high barrier for independent verification. While the metrics are clearly defined—including the memorization score $M_i(X,Y)=\frac{\sum_{k=1}^{n}\mathbf{I}(x_{i,k}=y_{i,k})}{n}$ and residual noise injection $\tilde{\mathbf{H}}_{\ell}=\mathbf{H}_{\ell}+\boldsymbol{\varepsilon}$ with $\sigma_{\mathrm{eff}}=\alpha\cdot\operatorname{RMS}(\mathbf{H}_{\ell})$—the authors do not indicate whether code will be released. The limitation that 'Infini-gram does not provide a query API for OLMo3 models' also means frequency analyses cannot be fully replicated for the most recent models studied.

“To generate all memorization scores across 6 noise strengths for all models used in this study, it takes around 2 months for 8 A100 servers”
Chen et al. · Appendix A.1.1
“Infini-gram does not provide a query API for OLMo3 models”
Chen et al. · Section 4.3
Abstract

Memorization is a fundamental component of intelligence for both humans and LLMs. However, while LLM performance scales rapidly, our understanding of memorization lags. Due to limited access to the pre-training data of LLMs, most previous studies focus on a single model series, leading to isolated observations among series, making it unclear which findings are general or specific. In this study, we collect multiple model series (Pythia, OpenLLaMa, StarCoder, OLMo1/2/3) and analyze their shared or unique memorization behavior at both the statistical and internal levels, connecting individual observations while showing new findings. At the statistical level, we reveal that the memorization rate scales log-linearly with model size, and memorized sequences can be further compressed. Further analysis demonstrated a shared frequency and domain distribution pattern for memorized sequences. However, different models also show individual features under the above observations. At the internal level, we find that LLMs can remove certain injected perturbations, while memorized sequences are more sensitive. By decoding middle layers and attention head ablation, we revealed the general decoding process and shared important heads for memorization. However, the distribution of those important heads differs between families, showing a unique family-level feature. Through bridging various experiments and revealing new findings, this study paves the way for a universal and fundamental understanding of memorization in LLM.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.