Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis

cs.CL Tae-Eun Song · Mar 23, 2026
Local to this browser
What it does
This paper introduces Cross-Context Verification (CCV), a black-box method for detecting LLM benchmark contamination by solving the same coding problem $N$ times in isolated sessions and measuring solution diversity. The key insight is...
Why it matters
The paper pairs this with Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that uses strict information restriction to prevent confirmation bias. As coding benchmarks face credibility crises from solution...
Main concern
The paper presents a methodologically promising approach with strong theoretical grounding and encouraging preliminary results, but its empirical validation is severely limited by a sample size of just 9 problems (3 contaminated, 6...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper introduces Cross-Context Verification (CCV), a black-box method for detecting LLM benchmark contamination by solving the same coding problem $N$ times in isolated sessions and measuring solution diversity. The key insight is that memorized solutions are deterministic while genuine reasoning produces natural variation. The paper pairs this with Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that uses strict information restriction to prevent confirmation bias. As coding benchmarks face credibility crises from solution leakage, this work targets the urgent need to distinguish reasoning from recall in SWE-bench evaluations.

Critical review
Verdict
Bottom line

The paper presents a methodologically promising approach with strong theoretical grounding and encouraging preliminary results, but its empirical validation is severely limited by a sample size of just 9 problems (3 contaminated, 6 genuine) evaluated on a single model (Claude Opus 4.6). While the statistical test shows perfect separation ($U=0$, $p\approx 0.012$, $r=1.0$) for this sample, the claim that contamination is binary ($\text{CS} \geq 0.6$ vs. $<0.6$) is overstated given the small $n$. The HCCA architecture's effectiveness is convincingly demonstrated by the pilot experiment where breaking information restriction produces 100% sycophantic confirmation, but this also highlights how fragile the method is. The work makes valuable contributions as a proof-of-concept, but its practical utility awaits validation at scale.

“Mann-Whitney U=0, p≈0.012, r=1.0”
Song et al., Table 1 · Section 6.1
“binary contamination—models either recall perfectly or not at all”
Song et al. · Section 7
“The Verifier exhibited systematic sycophancy... confirmed 100% of Worker findings across all 3 artifacts (15/15 findings)”
Song et al. · Section 4.4
What holds up

The core intuition—that memorized recall produces deterministic outputs while reasoning produces diversity—is well-motivated by cited prior work and hardware-level non-determinism literature. The HCCA architecture's emphasis on information restriction over structural complexity is validated by converging evidence: CCR shows context separation works (4.0 F1 improvement), D-CCR shows repetition without new information degrades performance ($F1: 0.376 \to 0.303$, $p<0.001$), and the pilot experiment shows structural roles without information restriction yield zero filtering benefit. The scoring formula design is transparent, using weighted combinations of AST similarity ($0.4$), BLEU ($0.3$), and edit distance ($0.3$) for diversity, with the claimed $0.5\bar{g} + 0.3(1-\text{diversity}) + 0.2(1-\sigma_g)$ contamination score derived from diagnostic reasoning. The discovery of contamination-flaw composites (e.g., astropy-7606 with $\text{CS}=0.641$) demonstrates the method's ability to detect nuanced cases that pure solution-leakage screening would misclassify.

“CCR reached an F1 of 28.6%, outperforming SR (24.6%, p=0.008)”
Song (2026a) · Abstract
“Single-pass CCR (F1 = 0.376) significantly outperformed all multi-turn variants... F1 = 0.303, p<0.001”
Song (2026b) · Abstract
“CS=0.5\bar{g}+0.3(1-\text{diversity})+0.2(1-\sigma_g)”
Song et al. · Section 3.3
Main concerns

The sample size of 9 problems (45 total trials) is insufficient to support the strong claims made about binary contamination and perfect discriminative power. With only 3 contaminated problems, the claim that contamination is binary—models either achieve $\text{CS} \geq 0.95$ or $\leq 0.53$—could easily be an artifact of the small sample or problem selection. The reasoning classifier achieves 100% accuracy (15/15 NO_REASONING, 30/30 FULL_REASONING) but this is evaluated on the same data used for contamination detection, creating circular validation risk; while the paper claims behavioral independence, this is not established on held-out data. The scoring weights are claimed as 'set a priori... not tuned on experimental data' but this cannot be verified by readers. The 'perfect separation' claim ($r=1.0$) is technically accurate for this specific sample but potentially misleading about real-world robustness given that only one problem (astropy-7606) falls in the intermediate range (0.641), and this is flagged as a composite edge case requiring manual correction. Additionally, all experiments use Claude Opus 4.6 only, leaving open questions about cross-model generalizability.

“circular validation... but is evaluated on the same data used for detection”
Song et al. · Section 8
“Weights were set a priori based on the diagnostic reasoning above, not tuned on experimental data”
Song et al. · Section 3.3
“With only n=3 contaminated problems... we cannot rule out intermediate cases at larger scale”
Song et al. · Section 6.2
Evidence and comparison

The paper situates CCV against existing methods (CAP, SWE-bench+, perplexity-based) in Table 2, correctly noting that CCV uniquely 'observes gen.' (generation process) while others analyze artifacts like paraphrase consistency or n-gram overlap. The cited works—particularly Aleithan et al. (2024) finding 32.67% solution leakage and Liang et al. (2025) showing 76% file-path recall accuracy—establish the contamination problem CCV addresses. However, the paper does not empirically compare CCV against CAP or perplexity methods on the same problems, leaving unclear whether CCV's diversity metrics would catch cases these methods miss (e.g., paraphrased memorization) or vice versa. The comparison to Song (2026b) is rigorous and shows that HCCA's structural division succeeds where D-CCR's multi-turn repetition fails, establishing that information restriction—not structural complexity or repetition—is the operative mechanism. The false positive detection claim (33% of prior labels) is supported only by one case (astropy-13236), which is suggestive but not conclusive.

“32.67% of the successful patches involve 'cheating' as the solutions were directly provided in the issue report”
“SoTA models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure”
“HCCA's independent analysis structure discovers contamination-flaw composite cases that single-analyst approaches miss”
Song et al. · Section 7
Reproducibility

Experimental details are thoroughly documented: Docker containers (python:3.11-slim), temperature 0, 5 trials per problem, session isolation protocols including git history purging, Claude Opus 4.6, and specific problem IDs (9 SWE-bench Verified problems). The code and data are released with an anonymized link. However, several factors would block independent reproduction: (1) Claude Opus 4.6 is a specific model version that may not remain available; (2) the $ ext{CS} = 0.5\bar{g} + 0.3(1-\text{diversity}) + 0.2(1-\sigma_g)$ formula weights and the $0.6$/$0.8$ thresholds, while claimed as a priori, cannot be verified as untuned; (3) the reasoning classifier ($<300$ tokens and 'diff' prefix vs. 'Looking at'/'The issue is') would be brittle to prompt variations or models trained to generate fake reasoning; (4) the 'memory-of-non-gold' correction for test flaw scores depends on manual thresholds ($\text{diversity} < 0.05 \wedge \bar{g} < 0.5$) that may not generalize. The HCCA implementation requires 6 terminal sessions with specific information flow constraints ($T5 \to T2/T3/T4 \to T1$) that would be difficult to replicate without exact orchestration code.

“Claude Opus 4.6 (claude-opus-4-6), temperature 0.0, 5 trials per problem, 45 total trials”
Song et al. · Section 5
“6 terminals across 4 layers... Terminal state is persisted via markdown files”
Song et al. · Appendix C
“NO_REASONING: Output begins with 'diff or patch content, total tokens <300... FULL_REASONING: Output begins with analytical phrases”
Song et al. · Section 3.4
Abstract

LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed. We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that prevents confirmation bias through intentional information restriction across specialized analytical roles. On 9 SWE-bench Verified problems (45 trials, Claude Opus 4.6, temperature 0), CCV achieves perfect separation between contaminated and genuine reasoning (Mann-Whitney U=0, p approx 0.012, r = 1.0). Key findings: (1) contamination is binary--models either recall perfectly or not at all; (2) reasoning absence is a perfect discriminator; (3) 33% of prior contamination labels are false positives; (4) HCCA's independent analysis structure discovers contamination-flaw composite cases that single-analyst approaches miss. A pilot experiment extending HCCA to multi-stage verification (Worker to Verifier to Director) yields a negative result--100% sycophantic confirmation--providing further evidence that information restriction, not structural complexity, is the key mechanism. We release all code and data.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.