Context Selection for Hypothesis and Statistical Evidence Extraction from Full-Text Scientific Articles

cs.CL cs.AI cs.DL Sai Koneru, Jian Wu, Sarah Rajtmajer · Mar 22, 2026
Local to this browser
What it does
Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is challenging due to document length and the distribution of scientific arguments across sections. This paper proposes a two-stage...
Why it matters
This paper proposes a two-stage retrieve-and-extract pipeline that first links an abstract finding to its corresponding hypothesis, then extracts the statistical evidence supporting that hypothesis. Through controlled ablations varying...
Main concern
The paper presents a rigorous controlled study of retrieval configuration for scientific claim trace extraction, with careful experimental design that uses oracle contexts to separate retrieval failures from extraction limits. The key...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is challenging due to document length and the distribution of scientific arguments across sections. This paper proposes a two-stage retrieve-and-extract pipeline that first links an abstract finding to its corresponding hypothesis, then extracts the statistical evidence supporting that hypothesis. Through controlled ablations varying context quantity ($k \in \{5, 10, 20\}$), retrieval quality (standard RAG, reranking, fine-tuned retriever), and oracle paragraph settings, the authors demonstrate that hypothesis extraction is primarily bounded by retrieval quality, while evidence extraction faces persistent extractor limitations even with perfect paragraph selection.

Critical review
Verdict
Bottom line

The paper presents a rigorous controlled study of retrieval configuration for scientific claim trace extraction, with careful experimental design that uses oracle contexts to separate retrieval failures from extraction limits. The key finding—that hypothesis extraction benefits strongly from targeted context selection while statistical evidence extraction remains difficult even with gold paragraphs—is well-supported by the results. However, the work is limited to social and behavioral science (SBS) articles with a single annotated claim trace per document, and the exclusion of tables and figures may artificially constrain the evidence extraction ceiling.

“Even with oracle paragraphs, performance remains moderate, indicating persistent extractor limitations in handling hybrid numeric-textual statements rather than retrieval failures alone.”
paper · Abstract
What holds up

The experimental methodology is robust, particularly the use of oracle paragraph settings to establish upper bounds on extraction performance independent of retrieval errors. The fine-tuned retriever trained with hard negatives sampled from within the same document effectively addresses the challenge of rhetorically similar but functionally distinct paragraphs. The component-level evaluation schema for evidence extraction appropriately handles the hybrid numeric-textual nature of statistical reports, assigning partial credit for correctly identified variables and relationships even when numerical details differ.

“To establish an upper bound under perfect paragraph selection, we provide the extractor with gold paragraphs... This isolates extractor limitations from retrieval failures.”
paper · Section 4.2
Main concerns

The study is limited to social and behavioral science (SBS) disciplines, which may not generalize to other fields with different writing conventions. The dataset contains only a single annotated claim trace per paper, despite documents often containing multiple hypotheses, and the preprocessing step excludes tables, figures, and equations—potentially omitting statistical evidence reported in non-prose formats. Additionally, the semantic similarity threshold used for hypothesis evaluation ($\tau = 0.89$) can miss subtle but important semantic shifts, as illustrated in Figure 6 where high lexical overlap masked a construct change from 'alternative option' to 'conspiracy-statement response'.

“Our experiments are limited to claim traces extracted from empirical research in the SBS disciplines... Our dataset contains a single annotated claim trace per paper and is not exhaustive...”
paper · Limitations section
“High lexical/semantic overlap ('covert influence') but the prediction shifts the core construct and mechanism from alternative option to conspiracy-statement response...”
paper · Figure 6
Evidence and comparison

The evidence supports the central claims: Table 1 shows consistent gains for hypothesis extraction as retrieval quality improves (Fine-tuned Retriever+Reranker $k=5$ achieves mean F1 0.50 vs Full-text 0.39), while evidence extraction plateaus at moderate F1 values even in oracle settings (0.47–0.55). The comparison to full-text prompting is fair and demonstrates that simply fitting entire documents into long context windows is insufficient for reliable extraction. The related work adequately covers scientific information extraction and RAG configuration studies, though the paper could more explicitly contrast with concurrent work on long-context LLM capabilities.

“For evidence extraction, oracle contexts reveal a different bottleneck: even with perfect paragraph selection, performance remains moderate (oracle Evidence F1 0.47-0.55 across models; Table 1).”
paper · Section 6.3
Reproducibility

The authors provide code, prompts, and experiment configurations at a public repository, and report detailed hyperparameters including the SciBERT retriever configuration and the calibrated similarity threshold ($\tau = 0.89$) for hypothesis evaluation. However, reproduction depends on access to specific commercial LLM versions (GPT-4o-mini, Gemini-2.5-Flash, GPT-OSS) that may be updated or deprecated, and the fine-tuned retriever requires the specific SCORE dataset split. The exclusion of non-prose content (tables, figures) during preprocessing is clearly documented, though this design choice means the oracle setting does not represent a true ceiling for evidence extraction in real documents where statistical details often appear in tabular form.

“Code, prompts, and experiment configurations are available at https://github.com/SaiDileepKoneru/ScientificClaimTraceExtraction.git”
paper · Footnote 1
“We select a similarity threshold... choosing the value that maximizes F1... We use the resulting threshold, $\tau = 0.89$”
paper · Section 5.1
Abstract

Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is central to the synthesis of empirical findings, but remains difficult due to document length and the distribution of scientific arguments across sections of the paper. The work studies a sequential full-text extraction setting, where the statement of a primary finding in an article's abstract is linked to (i) a corresponding hypothesis statement in the paper body and (ii) the statistical evidence that supports or refutes that hypothesis. This formulation induces a challenging within-document retrieval setting in which many candidate paragraphs are topically related to the finding but differ in rhetorical role, creating hard negatives for retrieval and extraction. Using a two-stage retrieve-and-extract framework, we conduct a controlled study of retrieval design choices, varying context quantity, context quality (standard Retrieval Augmented Generation, reranking, and a fine-tuned retriever paired with reranking), as well as an oracle paragraph setting to separate retrieval failures from extraction limits across four Large Language Model extractors. We find that targeted context selection consistently improves hypothesis extraction relative to full-text prompting, with gains concentrated in configurations that optimize retrieval quality and context cleanliness. In contrast, statistical evidence extraction remains substantially harder. Even with oracle paragraphs, performance remains moderate, indicating persistent extractor limitations in handling hybrid numeric-textual statements rather than retrieval failures alone.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.