Context Selection for Hypothesis and Statistical Evidence Extraction from Full-Text Scientific Articles
Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is challenging due to document length and the distribution of scientific arguments across sections. This paper proposes a two-stage retrieve-and-extract pipeline that first links an abstract finding to its corresponding hypothesis, then extracts the statistical evidence supporting that hypothesis. Through controlled ablations varying context quantity ($k \in \{5, 10, 20\}$), retrieval quality (standard RAG, reranking, fine-tuned retriever), and oracle paragraph settings, the authors demonstrate that hypothesis extraction is primarily bounded by retrieval quality, while evidence extraction faces persistent extractor limitations even with perfect paragraph selection.
The paper presents a rigorous controlled study of retrieval configuration for scientific claim trace extraction, with careful experimental design that uses oracle contexts to separate retrieval failures from extraction limits. The key finding—that hypothesis extraction benefits strongly from targeted context selection while statistical evidence extraction remains difficult even with gold paragraphs—is well-supported by the results. However, the work is limited to social and behavioral science (SBS) articles with a single annotated claim trace per document, and the exclusion of tables and figures may artificially constrain the evidence extraction ceiling.
The experimental methodology is robust, particularly the use of oracle paragraph settings to establish upper bounds on extraction performance independent of retrieval errors. The fine-tuned retriever trained with hard negatives sampled from within the same document effectively addresses the challenge of rhetorically similar but functionally distinct paragraphs. The component-level evaluation schema for evidence extraction appropriately handles the hybrid numeric-textual nature of statistical reports, assigning partial credit for correctly identified variables and relationships even when numerical details differ.
The study is limited to social and behavioral science (SBS) disciplines, which may not generalize to other fields with different writing conventions. The dataset contains only a single annotated claim trace per paper, despite documents often containing multiple hypotheses, and the preprocessing step excludes tables, figures, and equations—potentially omitting statistical evidence reported in non-prose formats. Additionally, the semantic similarity threshold used for hypothesis evaluation ($\tau = 0.89$) can miss subtle but important semantic shifts, as illustrated in Figure 6 where high lexical overlap masked a construct change from 'alternative option' to 'conspiracy-statement response'.
The evidence supports the central claims: Table 1 shows consistent gains for hypothesis extraction as retrieval quality improves (Fine-tuned Retriever+Reranker $k=5$ achieves mean F1 0.50 vs Full-text 0.39), while evidence extraction plateaus at moderate F1 values even in oracle settings (0.47–0.55). The comparison to full-text prompting is fair and demonstrates that simply fitting entire documents into long context windows is insufficient for reliable extraction. The related work adequately covers scientific information extraction and RAG configuration studies, though the paper could more explicitly contrast with concurrent work on long-context LLM capabilities.
The authors provide code, prompts, and experiment configurations at a public repository, and report detailed hyperparameters including the SciBERT retriever configuration and the calibrated similarity threshold ($\tau = 0.89$) for hypothesis evaluation. However, reproduction depends on access to specific commercial LLM versions (GPT-4o-mini, Gemini-2.5-Flash, GPT-OSS) that may be updated or deprecated, and the fine-tuned retriever requires the specific SCORE dataset split. The exclusion of non-prose content (tables, figures) during preprocessing is clearly documented, though this design choice means the oracle setting does not represent a true ceiling for evidence extraction in real documents where statistical details often appear in tabular form.
Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is central to the synthesis of empirical findings, but remains difficult due to document length and the distribution of scientific arguments across sections of the paper. The work studies a sequential full-text extraction setting, where the statement of a primary finding in an article's abstract is linked to (i) a corresponding hypothesis statement in the paper body and (ii) the statistical evidence that supports or refutes that hypothesis. This formulation induces a challenging within-document retrieval setting in which many candidate paragraphs are topically related to the finding but differ in rhetorical role, creating hard negatives for retrieval and extraction. Using a two-stage retrieve-and-extract framework, we conduct a controlled study of retrieval design choices, varying context quantity, context quality (standard Retrieval Augmented Generation, reranking, and a fine-tuned retriever paired with reranking), as well as an oracle paragraph setting to separate retrieval failures from extraction limits across four Large Language Model extractors. We find that targeted context selection consistently improves hypothesis extraction relative to full-text prompting, with gains concentrated in configurations that optimize retrieval quality and context cleanliness. In contrast, statistical evidence extraction remains substantially harder. Even with oracle paragraphs, performance remains moderate, indicating persistent extractor limitations in handling hybrid numeric-textual statements rather than retrieval failures alone.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.