TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols

cs.CL cs.LG Saketh Vinjamuri, Marielle Fis Loperena, Marie C. Spezia, Ramez Kouzy · Mar 22, 2026

What it does

Why it matters

This work proposes TimeTox, a Gemini-based LLM pipeline that extracts time toxicity from protocol PDFs at scale, comparing a single-pass architecture against a two-stage structure-then-count approach. The authors deploy their system on 644...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Time toxicity—the cumulative healthcare contact days imposed by clinical trial participation—is an important patient-centric metric buried in dense Schedule of Assessments (SoA) tables. This work proposes TimeTox, a Gemini-based LLM pipeline that extracts time toxicity from protocol PDFs at scale, comparing a single-pass architecture against a two-stage structure-then-count approach. The authors deploy their system on 644 real-world oncology protocols and find that synthetic benchmark accuracy is a poor predictor of real-world reliability, a lesson critical for clinical NLP deployment.

Critical review

Verdict

Bottom line

The paper delivers a pragmatic, production-ready system for a clinically meaningful extraction task, with its central insight—that stability on heterogeneous real-world data trumps synthetic accuracy—deserving attention from the clinical NLP community. However, the validation strategy relies on a small synthetic corpus for accuracy claims and lacks formal benchmarking against other LLM families or human annotators, limiting confidence in absolute error rates despite strong evidence of reproducibility.

“Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment”

Vinjamuri et al. · Abstract

What holds up

The position-based consensus mechanism is a clever solution to arm-name instability across LLM runs, using sorted 12-month contact day counts to align arms without relying on string matching. The scale of real-world testing—644 protocols spanning 11 disease sites—demonstrates feasibility, and the authors' intellectual honesty about the synthetic-real performance gap strengthens the work. The pipeline's 95.3% clinically acceptable reproducibility (IQR $\leq$ 3 days) on real protocols is a compelling result for deployment.

“Perfect Stability (IQR = 0): 82.0% ... Clinically Acceptable Accuracy (IQR $\leq$ 3): 95.3%”

Vinjamuri et al. · Section 6.3

“position-based matching leverages the insight that, even when arm names vary, the relative ordering of arms by time-toxicity burden is stable”

Vinjamuri et al. · Section 7.2

Main concerns

The synthetic ground truth comprises only 20 schedules (240 comparisons), providing limited coverage of real-world formatting heterogeneity that the paper itself identifies as the primary challenge. The vanilla pipeline showed substantial systematic overcounting on synthetic data (median signed error +4.0 days, mean +6.9 days), yet this bias is neither corrected nor quantified on real-world protocols since no ground truth exists. The position-based consensus risks cross-arm contamination when multiple arms have similar contact-day counts—the authors note 43.8% of adjacent same-type arm pairs had values within 3 days. Additionally, the evaluation is restricted to Google’s Gemini family, leaving open whether GPT-4 or Claude would exhibit the same stability-accuracy trade-off, and the real-world validation relies on stability as a proxy for accuracy without formal human verification.

“Overall, the median signed error was +4.0 days (mean +6.9 days), indicating a systematic tendency toward overcounting”

Vinjamuri et al. · Section 5.2.1

“7 of 16 adjacent same-type arm pairs (43.8%) had 12-month contact day values within 3 days of each other, indicating that close-value pairs are common”

Vinjamuri et al. · Section 7.2.1

“real-world validation relied on stability testing and informal spot checking against manually reviewed protocols rather than a comprehensive gold standard”

Vinjamuri et al. · Section 9.3

Evidence and comparison

The comparative evidence between the two architectures is methodically presented, using MAE and clinically acceptable accuracy ($\pm 3$ days) on the synthetic set, and IQR-based stability metrics on real data. However, the reliance on synthetic schedules for accuracy claims is problematic given the authors' own finding that synthetic performance does not correlate with real-world stability. The related work discusses prior LLM extraction tasks but provides no direct empirical comparison to non-LLM baselines (e.g., rule-based table parsers) or to alternative prompting strategies like chain-of-thought, making it difficult to assess whether the vanilla pipeline is optimal or simply the first that proved sufficiently stable.

“Table 3: Validation Results on 20 Synthetic Schedules (240 Comparisons)”

Vinjamuri et al. · Section 5.2

“Our work extends this line of research to a specific, previously unautomated task... that requires both visual table parsing and multi-step arithmetic reasoning”

Vinjamuri et al. · Section 9.4

Reproducibility

The paper provides model identifiers (Gemini 2.5 Flash for summary extraction, Gemini 3.0 Flash for extraction), hyperparameters (temperature 0.1, top-p 0.95), and states that code is available on GitHub. However, the full prompts are only excerpted (76 lines for the vanilla prompt described but not fully shown), and the exact JSON schemas for forced output are not provided. The real-world protocol dataset is derived from public ClinicalTrials.gov PDFs, though the specific 644-protocol subset is not itemized. The synthetic schedule generator is described as 1,284 lines of code, which should aid reproduction of the validation set.

“Google's Gemini 3.0 Flash (model ID: gemini-3-flash-preview) as the base model with temperature 0.1”

Vinjamuri et al. · Section 4

“The complete pipeline code, synthetic schedule generator, and analysis scripts are available at [GitHub URL]”

Vinjamuri et al. · Section 10

Abstract

Time toxicity, the cumulative healthcare contact days from clinical trial participation, is an important but labor-intensive metric to extract from protocol documents. We developed TimeTox, an LLM-based pipeline for automated extraction of time toxicity from Schedule of Assessments tables. TimeTox uses Google's Gemini models in three stages: summary extraction from full-length protocol PDFs, time toxicity quantification at six cumulative timepoints for each treatment arm, and multi-run consensus via position-based arm matching. We validated against 20 synthetic schedules (240 comparisons) and assessed reproducibility on 644 real-world oncology protocols. Two architectures were compared: single-pass (vanilla) and two-stage (structure-then-count). The two-stage pipeline achieved 100% clinically acceptable accuracy ($\pm$3 days) on synthetic data (MAE 0.81 days) versus 41.5% for vanilla (MAE 9.0 days). However, on real-world protocols, the vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy (IQR $\leq$ 3 days) across 3 runs on 644 protocols, with 82.0% perfect stability (IQR = 0). The production pipeline extracted time toxicity for 1,288 treatment arms across multiple disease sites. Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.