On the Challenges and Opportunities of Learned Sparse Retrieval for Code

cs.IR cs.CL Simon Lupart, Maxime Louis, Thibault Formal, Herv\'e D\'ejean, St\'ephane Clinchant · Mar 23, 2026
Local to this browser
What it does
Code retrieval currently relies on dense embeddings, but this paper proposes SPLADE-Code, the first large-scale learned sparse retrieval (LSR) family for code search (600M–8B parameters). The authors address unique challenges including...
Why it matters
4 nDCG@10 on MTEB Code under 1B parameters (state-of-the-art for that size) and 79. 0 with 8B parameters, while enabling sub-millisecond retrieval via inverted indices.
Main concern
SPLADE-Code presents a compelling case for LSR in code retrieval, backed by strong empirical results across diverse benchmarks (CoIR, MTEB-Code, CodeRAG, CPRet). The paper convincingly demonstrates that sparse retrieval can match or exceed...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Code retrieval currently relies on dense embeddings, but this paper proposes SPLADE-Code, the first large-scale learned sparse retrieval (LSR) family for code search (600M–8B parameters). The authors address unique challenges including subword fragmentation, semantic gaps between natural language and code, and latency issues from long code documents. Their lightweight single-stage training achieves 75.4 nDCG@10 on MTEB Code under 1B parameters (state-of-the-art for that size) and 79.0 with 8B parameters, while enabling sub-millisecond retrieval via inverted indices.

Critical review
Verdict
Bottom line

SPLADE-Code presents a compelling case for LSR in code retrieval, backed by strong empirical results across diverse benchmarks (CoIR, MTEB-Code, CodeRAG, CPRet). The paper convincingly demonstrates that sparse retrieval can match or exceed dense retrieval effectiveness while offering superior efficiency and interpretability. However, the reliance on checkpoint merging for peak performance and the limited scale of latency experiments (1M passages) temper the practical impact claims.

“SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code)”
paper · Abstract
“Compared to dense, SPLADE-Code-8B outperforms them in both in-domain and out-of-domain evaluation”
paper · Section 4.2
What holds up

The controlled experimental design is rigorous: when trained on identical data without checkpoint merging, SPLADE-Code consistently matches or exceeds dense baselines across 22 datasets spanning 20+ programming languages (Table 2). The analysis of expansion tokens provides genuine insight—the paper shows that ~65% of top-25 activated terms are expansions not present in the input, bridging lexical matching and semantic abstraction (Figure 4). The efficiency analysis demonstrating sub-millisecond retrieval with aggressive pruning (10, 100) while maintaining effectiveness is practically valuable.

“SPLADE-Code consistently matches or exceeds dense baselines across the four benchmarks”
paper · Table 2
“Among the top-25 activated terms, roughly 35% come from the input itself and 65% are expansion terms”
paper · Section 4.3
“SPLADE achieves below 1 ms per query on the CodeSearchNet 1M passage collection”
paper · Figure 3
Main concerns

The paper downplays the complexity of achieving peak performance. While the base training is 'lightweight,' the best results rely on weighted spherical merging of three checkpoints (base, second epoch, and 1024-length variant), with a fourth for the 0.6B model (Table 7). This multi-checkpoint ensemble obscures whether a single training run achieves the claimed SOTA results. Additionally, the strong reliance on English as the 'matching language' (Section 4.3) raises questions about performance on codebases with non-English identifiers or comments, which the benchmarks may not adequately cover.

“Model merging is done with weighted spherical merging from three checkpoints... Small models (0.6B) also contain a fourth checkpoint”
paper · Appendix A
“SPLADE-Code uses mostly English as matching language between textual instructions and code from different programming languages”
paper · Section 4.3
Evidence and comparison

The evidence supports the core claim that LSR can compete with dense retrieval for code. The comparison with C2-LLM appears fair—both use the CoIR dataset and checkpoint merging, and the controlled experiments (Table 2) isolate architectural differences from training data differences. However, the comparison against CodeR-all is imbalanced since it was trained on 5.2M samples versus 2.2M for SPLADE-Code. The paper acknowledges this limitation appropriately. The out-of-domain generalization results (Table 4) are strong, showing SPLADE-Code outperforms C2-LLM by +5.8 and +6.1 points on CodeRAG Bench.

“CodeR-all... was trained on 5.2M training samples (vs. 2.2M for all other models)”
paper · Section 4.2
“SPLADE-Code achieves higher performance than most models... surpasses C2-LLM by 5.8 and 6.1 nDCG@10 points on CodeRAG Bench”
paper · Table 4
Reproducibility

Reproducibility is partially strong but has gaps. The paper provides detailed hyperparameters (Table 7), including LoRA rank 64, temperature 300 for KLD loss, and batch size 256. They release models at HuggingFace (naver/splade-code-0.6B, naver/splade-code-8B). However, the checkpoint merging procedure—crucial for the reported SOTA numbers—requires specific checkpoint selection and weighted spherical merging that lacks implementation details. The latency experiments use the Seismic library with specific pruning parameters ($k=1000$, query_cut=500), but the exact merging weights and the full training code are not specified, potentially blocking exact reproduction of the best results.

“We train our model on CoIR... with LoRA rank of 64, a batch size of 256 with 7 negatives per query, and a max length of 512”
paper · Appendix A
“naver/splade-code-0.6B, naver/splade-code-8B”
paper · Title/Header
“Model merging is done with weighted spherical merging from three checkpoints”
paper · Appendix A
Abstract

Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion tokens are critical to bridge lexical and semantic matching, and provide a latency analysis showing that LSR enables sub-millisecond retrieval on a 1M-passage collection with little effectiveness loss.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.