When Documents Disagree: Measuring Institutional Variation in Transplant Guidance with Retrieval-Augmented Language Models

cs.IR cs.AI Yubo Li, Ramayya Krishnan, Rema Padman · Mar 23, 2026
Local to this browser
What it does
This paper introduces a scalable framework to measure institutional variation in solid-organ transplant patient education materials using retrieval-augmented generation (RAG). The authors ground 1,115 patient questions across 102 handbooks...
Why it matters
8% of non-absent pairs show clinically meaningful divergence, with reproductive health nearly absent (95. 1%) across all materials.
Main concern
The paper presents a well-conceived framework that makes cross-center variation measurable at scale. The five-label taxonomy is clearly operationalized and the analysis across four dimensions (question, topic, organ, center) is...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper introduces a scalable framework to measure institutional variation in solid-organ transplant patient education materials using retrieval-augmented generation (RAG). The authors ground 1,115 patient questions across 102 handbooks from 23 U.S. centers, then classify answer pairs into a five-label taxonomy (Absent, Consistent, Complementary, Divergent, Contradictory). The work exposes critical information gaps: 96.2% of question-handbook pairs miss relevant content, and 20.8% of non-absent pairs show clinically meaningful divergence, with reproductive health nearly absent (95.1%) across all materials.

Critical review
Verdict
Bottom line

The paper presents a well-conceived framework that makes cross-center variation measurable at scale. The five-label taxonomy is clearly operationalized and the analysis across four dimensions (question, topic, organ, center) is comprehensive. The findings—that divergence concentrates in monitoring and lifestyle topics while coverage gaps dominate reproductive health—are clinically significant and actionable. The methodology represents a meaningful advance over prior descriptive studies of handbook variation. However, the reliance on LLM-based evaluation without reported human expert validation introduces uncertainty about classification accuracy, particularly for distinguishing Divergent from Complementary guidance where clinical nuance matters.

“Five-Label Comparison Taxonomy... Divergent: Substantive, clinically meaningful differences (e.g., different thresholds or timelines)”
Li et al., Table 2 · Section 2.4.2
“Among the 68,019 non-absent pairs... Divergent 14,132 (20.8%)... Contradictory 143 (0.2%)”
Li et al., Table 3 · Section 3.1
What holds up

The five-label consistency taxonomy (Absent, Consistent, Complementary, Divergent, Contradictory) is well-defined with concrete examples that enable reproducible classification. The scale of the study—102 handbooks across 23 centers and 5 organ types—provides statistical power for organ- and topic-level comparisons. The finding that Monitoring & Follow-up exhibits the highest mean divergence rate ($R_{\text{div}}=0.277$) while Reproductive Health shows the highest consistency rate ($R_{\text{con}}=0.315$) but lowest divergence prevalence (17.5%) is well-explained by the extreme absence rate (95.1%) in the latter. The center-level divergence profiles are stable and interpretable, ranging from $R_{\text{div}}=0.139$ to $0.255$, supporting the claim that variation reflects systematic institutional differences rather than noise.

“Reproductive Health... R_{\text{con}}=0.315... %Div=17.5%”
Li et al., Table 6 · Section 3.4
“center-level divergence profiles are stable and interpretable... divergence rates ranged from 0.139 to 0.255 across centers”
Li et al. · Section 4.1
Main concerns

The LLM-as-judge methodology raises significant validity concerns. While the authors conducted sample annotation and agreement checks, the main paper does not report inter-annotator agreement statistics or validation against human clinicians for the pairwise comparison task. The distinction between Divergent and Complementary depends on subtle clinical judgments that may not align with physician assessments. The retrieval pipeline processes only textual content, missing tables and figures that could address questions currently coded as coverage gaps—this is acknowledged but not quantified. The generalizability claim that heterogeneity reflects systematic institutional differences is plausible but alternative explanations (authorship conventions, document length, center size) are not formally tested. The paper also lacks explicit discussion of potential retrieval errors affecting downstream classification—if the hybrid retrieval fails to find relevant passages, answers may be incorrectly labeled Absent or Divergent.

“Several limitations should be acknowledged. Although we conducted sample annotation and agreement checks, the LLM-based pairwise judge may still introduce systematic classification biases”
Li et al. · Section 4.2
“our pipeline currently processes only the textual content of handbooks, yet many handbooks also contain tables, figures, and infographics”
Li et al. · Section 4.2
Evidence and comparison

The quantitative evidence supports the main claims about heterogeneity concentration in specific topics and organs. The finding that kidney and lung show highest divergence prevalence (39.8% and 41.8%) while pancreas lowest (11.0%) is well-supported by the pairwise comparison matrices. However, the comparison to prior work could be strengthened. The authors cite Mace et al. (2025) as finding significant variation using NLP and generative methods, but do not directly compare their quantitative metrics to this prior work or explain why their approach yields different or more precise estimates. The claim that divergence reflects systematic institutional differences rather than patient diversity is stated but not validated with covariates like center volume, region, or patient demographics.

“Kidney... %Div=39.8%... Lung... %Div=41.8%... Pancreas... %Div=11.0%”
Li et al., Table 5 · Section 3.3
“a comparative analysis of transplant handbooks using NLP and generative methods found significant variation in the availability and interpretation of clinical guidance”
Li et al. · Section 1
Reproducibility

Reproducibility is substantially compromised. The paper does not provide a code repository, detailed hyperparameters for the generation model beyond temperature=0, or the full prompts used for pairwise comparison. The 102 handbooks are proprietary documents shared by transplant centers via Transplants.org and cannot be publicly released, which is understandable but blocks independent reproduction. The LLaMA parse extraction pipeline and hybrid retrieval configuration (BM25 + BAAI/bge-large-en-v1.5 with RRF) are described at a high level but without implementation details like chunk overlap, specific FAISS index parameters, or reranking top-k settings. The benchmark question set is described as publicly sourced from forums but not provided. To reproduce this study, an independent investigator would need to re-curate similar data and re-implement substantial portions of the pipeline without guidance on validation thresholds or error-handling procedures.

“Qwen3-14B at temperature 0... k_{\text{RRF}}=60”
Li et al. · Section 2.3
“corpus of transplant patient education handbooks that were generously shared by U.S. transplant centers and assembled by the non-profit Transplants.org”
Li et al. · Section 2.1
Abstract

Patient education materials for solid-organ transplantation vary substantially across U.S. centers, yet no systematic method exists to quantify this heterogeneity at scale. We introduce a framework that grounds the same patient questions in different centers' handbooks using retrieval-augmented language models and compares the resulting answers using a five-label consistency taxonomy. Applied to 102 handbooks from 23 centers and 1,115 benchmark questions, the framework quantifies heterogeneity across four dimensions: question, topic, organ, and center. We find that 20.8% of non-absent pairwise comparisons exhibit clinically meaningful divergence, concentrated in condition monitoring and lifestyle topics. Coverage gaps are even more prominent: 96.2% of question-handbook pairs miss relevant content, with reproductive health at 95.1% absence. Center-level divergence profiles are stable and interpretable, where heterogeneity reflects systematic institutional differences, likely due to patient diversity. These findings expose an information gap in transplant patient education materials, with document-grounded medical question answering highlighting opportunities for content improvement.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.