TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols
Time toxicity—the cumulative healthcare contact days imposed by clinical trial participation—is an important patient-centric metric buried in dense Schedule of Assessments (SoA) tables. This work proposes TimeTox, a Gemini-based LLM pipeline that extracts time toxicity from protocol PDFs at scale, comparing a single-pass architecture against a two-stage structure-then-count approach. The authors deploy their system on 644 real-world oncology protocols and find that synthetic benchmark accuracy is a poor predictor of real-world reliability, a lesson critical for clinical NLP deployment.
The paper delivers a pragmatic, production-ready system for a clinically meaningful extraction task, with its central insight—that stability on heterogeneous real-world data trumps synthetic accuracy—deserving attention from the clinical NLP community. However, the validation strategy relies on a small synthetic corpus for accuracy claims and lacks formal benchmarking against other LLM families or human annotators, limiting confidence in absolute error rates despite strong evidence of reproducibility.
The position-based consensus mechanism is a clever solution to arm-name instability across LLM runs, using sorted 12-month contact day counts to align arms without relying on string matching. The scale of real-world testing—644 protocols spanning 11 disease sites—demonstrates feasibility, and the authors' intellectual honesty about the synthetic-real performance gap strengthens the work. The pipeline's 95.3% clinically acceptable reproducibility (IQR $\leq$ 3 days) on real protocols is a compelling result for deployment.
The synthetic ground truth comprises only 20 schedules (240 comparisons), providing limited coverage of real-world formatting heterogeneity that the paper itself identifies as the primary challenge. The vanilla pipeline showed substantial systematic overcounting on synthetic data (median signed error +4.0 days, mean +6.9 days), yet this bias is neither corrected nor quantified on real-world protocols since no ground truth exists. The position-based consensus risks cross-arm contamination when multiple arms have similar contact-day counts—the authors note 43.8% of adjacent same-type arm pairs had values within 3 days. Additionally, the evaluation is restricted to Google’s Gemini family, leaving open whether GPT-4 or Claude would exhibit the same stability-accuracy trade-off, and the real-world validation relies on stability as a proxy for accuracy without formal human verification.
The comparative evidence between the two architectures is methodically presented, using MAE and clinically acceptable accuracy ($\pm 3$ days) on the synthetic set, and IQR-based stability metrics on real data. However, the reliance on synthetic schedules for accuracy claims is problematic given the authors' own finding that synthetic performance does not correlate with real-world stability. The related work discusses prior LLM extraction tasks but provides no direct empirical comparison to non-LLM baselines (e.g., rule-based table parsers) or to alternative prompting strategies like chain-of-thought, making it difficult to assess whether the vanilla pipeline is optimal or simply the first that proved sufficiently stable.
The paper provides model identifiers (Gemini 2.5 Flash for summary extraction, Gemini 3.0 Flash for extraction), hyperparameters (temperature 0.1, top-p 0.95), and states that code is available on GitHub. However, the full prompts are only excerpted (76 lines for the vanilla prompt described but not fully shown), and the exact JSON schemas for forced output are not provided. The real-world protocol dataset is derived from public ClinicalTrials.gov PDFs, though the specific 644-protocol subset is not itemized. The synthetic schedule generator is described as 1,284 lines of code, which should aid reproduction of the validation set.
Time toxicity, the cumulative healthcare contact days from clinical trial participation, is an important but labor-intensive metric to extract from protocol documents. We developed TimeTox, an LLM-based pipeline for automated extraction of time toxicity from Schedule of Assessments tables. TimeTox uses Google's Gemini models in three stages: summary extraction from full-length protocol PDFs, time toxicity quantification at six cumulative timepoints for each treatment arm, and multi-run consensus via position-based arm matching. We validated against 20 synthetic schedules (240 comparisons) and assessed reproducibility on 644 real-world oncology protocols. Two architectures were compared: single-pass (vanilla) and two-stage (structure-then-count). The two-stage pipeline achieved 100% clinically acceptable accuracy ($\pm$3 days) on synthetic data (MAE 0.81 days) versus 41.5% for vanilla (MAE 9.0 days). However, on real-world protocols, the vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy (IQR $\leq$ 3 days) across 3 runs on 644 protocols, with 82.0% perfect stability (IQR = 0). The production pipeline extracted time toxicity for 1,288 treatment arms across multiple disease sites. Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.