CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs
Cross-Layer Transcoders (CLTs) compress the attribution graphs used in mechanistic interpretability by sharing features across transformer layers, but their quadratic parameter scaling ($N_{\text{CLT}} \propto L^2$) makes training and analysis prohibitively expensive for most researchers. This paper introduces CLT-Forge, an open-source library that combines feature-sharded distributed training, compressed activation caching (int8/int4/int2 with zstd), automated interpretability pipelines, and integration with Circuit-Tracer to provide the first unified workflow for end-to-end CLT analysis at scale.
CLT-Forge succeeds as an engineering contribution: it delivers a practical, modular toolkit that lowers barriers to CLT research. The library is well-architected, integrating training infrastructure (feature-wise GPU sharding), storage optimization (activation quantization), and analysis tools (attribution graphs via Circuit-Tracer, auto-interp, visualization) into a coherent pipeline. However, the novelty is primarily in systems engineering rather than algorithmic advances—the underlying CLT formulation and attribution methods come from prior work (Ameisen et al., 2025; Lindsey et al., 2025a), and the paper positions itself as a unification rather than improvement of these methods. The evaluation validates functionality but lacks ablation studies comparing design choices (e.g., caching vs. on-the-fly computation, different sharding strategies) or efficiency benchmarking against the EleutherAI baseline.
The technical implementation is sound and addresses real scalability bottlenecks. For a LLaMA 3.2 1B model with expansion factor 48 ($\sim$1.5M features), the authors demonstrate training across 8 GPUs using feature-sharding (not FSDP), with compression reducing activation storage from ~20TB to ~4TB for 300M tokens using int8 quantization. The authors note: "int8 offers a good balance, reducing storage 4–7× over a float16 baseline, int4 and int2 achieve 7–12× at the cost of higher quantization error." The modular design cleanly separates concerns (caching, training, auto-interpretability, attribution, visualization), and the integration with Circuit-Tracer (Hanna et al., 2025) enables pruning and intervention workflows that were previously inaccessible in open-source settings.
The evaluation is minimal: it reports that GPT-2 results match prior work (~0.8 explained variance) but provides no ablation studies on the claimed engineering improvements. The claim that the interface provides a "simplified and more easily extensible alternative to Neuronpedia" is subjective and unsupported by user studies or comparative analysis. More critically, the authors admit a fundamental limitation in the training objective: "We also found that directly optimizing replacement score led to unstable and noisy results, suggesting the need for more robust objectives or optimization schemes." This indicates that the current proxy loss (reconstruction error) may not align well with the actual goal of interpretability. Additionally, while the paper claims novelty as the "first unified library," EleutherAI's CLT-Training fork (2024) and Tigges (2025) provide overlapping functionality—the distinction (TopK vs. JumpReLU, activation sparsity) is real but nuanced, and direct comparisons are absent.
The evidence supports basic functionality claims: the library trains CLTs on GPT-2 and LLaMA 1B, computes attribution graphs (Figure 2 shows the "greater-than" circuit with replacement score 0.77), and links to Harrasse et al. (2025) for multilingual applications. However, benchmarking is thin. There are no training speed comparisons against EleutherAI's Sparsify fork, no analysis of how quantization affects downstream attribution quality (only reconstruction error is reported), and no validation that feature-sharding matches DDP performance given identical hyperparameters. The comparison to related work in Section 2.1 accurately distinguishes their approach (e.g., "we do not impose activation sparsity, precluding the use of specialized kernels") but omits quantitative trade-offs in speed or memory efficiency.
Reproducibility is strong in terms of code availability (MIT license, GitHub release) and documentation. The paper provides complete training configurations in Appendix A, including specific hyperparameters: lr=4e-4, l0_coefficient=2.0, jumprelu_init_threshold=0.03, and checkpoint selection criteria (checkpoint_l0=[10, 5], optimal_l0=5). However, practical reproducibility is limited by compute requirements: training a LLaMA 1B CLT requires 8× 80GB GPUs, and activation caching (even compressed) requires terabytes of storage—4TB for 300M tokens. The low-rank finetuning option (mentioned in Section 3.3) mitigates this for adaptation, but training from scratch remains inaccessible to many academic labs. The paper notes they "are currently releasing low-rank finetuned versions of large-scale CLTs," which helps address this barrier.
Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse, interpretable features and their interactions, giving rise to feature attribution graphs. However, these graphs are often large and redundant, limiting their interpretability in practice. Cross-Layer Transcoders (CLTs) address this issue by sharing features across layers while preserving layer-specific decoding, yielding more compact representations, but remain difficult to train and analyze at scale. We introduce an open-source library for end-to-end training and interpretability of CLTs. Our framework integrates scalable distributed training with model sharding and compressed activation caching, a unified automated interpretability pipeline for feature analysis and explanation, attribution graph computation using Circuit-Tracer, and a flexible visualization interface. This provides a practical and unified solution for scaling CLT-based mechanistic interpretability. Our code is available at: https://github.com/LLM-Interp/CLT-Forge.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.