Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
cs.CLcs.AIcs.DLSai Koneru, Jian Wu, Sarah Rajtmajer
·
Mar 22, 2026
Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is challenging due to document length and the distribution of scientific arguments across sections. This paper proposes a two-stage retrieve-and-extract pipeline that first links an abstract finding to its corresponding hypothesis, then extracts the statistical evidence supporting that hypothesis. Through controlled ablations varying context quantity ($k \in \{5, 10, 20\}$), retrieval quality (standard RAG, reranking, fine-tuned retriever), and oracle paragraph settings, the authors demonstrate that hypothesis extraction is primarily bounded by retrieval quality, while evidence extraction faces persistent extractor limitations even with perfect paragraph selection.
Extracting hypotheses and their supporting statistical evidence from full-text scientific articles is central to the synthesis of empirical findings, but remains difficult due to document length and the distribution of scientific arguments across sections of the paper. The work studies a sequential full-text extraction setting, where the statement of a primary finding in an article's abstract is linked to (i) a corresponding hypothesis statement in the paper body and (ii) the statistical evidence that supports or refutes that hypothesis. This formulation induces a challenging within-document retrieval setting in which many candidate paragraphs are topically related to the finding but differ in rhetorical role, creating hard negatives for retrieval and extraction. Using a two-stage retrieve-and-extract framework, we conduct a controlled study of retrieval design choices, varying context quantity, context quality (standard Retrieval Augmented Generation, reranking, and a fine-tuned retriever paired with reranking), as well as an oracle paragraph setting to separate retrieval failures from extraction limits across four Large Language Model extractors. We find that targeted context selection consistently improves hypothesis extraction relative to full-text prompting, with gains concentrated in configurations that optimize retrieval quality and context cleanliness. In contrast, statistical evidence extraction remains substantially harder. Even with oracle paragraphs, performance remains moderate, indicating persistent extractor limitations in handling hybrid numeric-textual statements rather than retrieval failures alone.