Multimodal Survival Analysis with Locally Deployable Large Language Models

cs.LG cs.AI Moritz G\"ogl, Christopher Yau · Mar 23, 2026
Local to this browser
What it does
This paper addresses multimodal survival analysis for clinical data, integrating pathology text, tabular covariates, and gene expression using locally deployable LLMs. The core innovation is a teacher-student distillation framework that...
Why it matters
5B parameter causal LLM to jointly produce calibrated survival curves and concise prognosis explanations. This matters because cloud-hosted medical AI raises privacy concerns, yet heavyweight local models are impractical for many...
Main concern
The paper presents a technically sound approach to privacy-preserving multimodal survival analysis. The dual-pathway architecture (hidden-state survival estimation plus verbalized generation) and the exploration of early/late fusion...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper addresses multimodal survival analysis for clinical data, integrating pathology text, tabular covariates, and gene expression using locally deployable LLMs. The core innovation is a teacher-student distillation framework that trains a compact 1.5B parameter causal LLM to jointly produce calibrated survival curves and concise prognosis explanations. This matters because cloud-hosted medical AI raises privacy concerns, yet heavyweight local models are impractical for many institutions.

Critical review
Verdict
Bottom line

The paper presents a technically sound approach to privacy-preserving multimodal survival analysis. The dual-pathway architecture (hidden-state survival estimation plus verbalized generation) and the exploration of early/late fusion strategies demonstrate careful engineering. However, the "locally deployable" framing is undermined by reliance on a 32B parameter teacher model (DeepSeek-R1 Distill Qwen-32B) for distillation, which most institutions cannot host locally. The evaluation is also limited to a single TCGA cohort without external validation.

“We first query a larger teacher LLM (here: DeepSeek-R1 Distill Qwen-32B) offline with two prompts”
Gögl & Yau, Sec. 3.3 · Section 3.3
“The teacher achieved substantially better verbalized performance, reflecting its larger model capacity”
Gögl & Yau, Sec. 4.2 · Section 4.2
What holds up

The multimodal fusion architecture is robust, offering both early concatenation and late gated fusion options that allow pre-training of covariate and gene-expression heads separately from the LLM. The calibration correction mechanism—which masks text loss contributions when teacher predictions conflict with observed outcomes—is a principled attempt to handle teacher miscalibration. The empirical results show consistent gains from multimodal fusion over unimodal baselines, and the authors honestly report failure modes in qualitative analysis.

“we optionally mask out text loss contributions for samples whose teacher 3-year survival estimate, TEACHER_PROB, is inconsistent with the observed status”
Gögl & Yau, Sec. 3.3 · Section 3.3
“Late fusion generally improved hidden-state and combined performance”
Gögl & Yau, Sec. 4.2 · Section 4.2
Main concerns

The most significant issue is the performance gap between verbalized and hidden-state predictions: the verbalized pathway achieves $C^{\text{td}} = 0.626$ versus $0.765$ for hidden-state (discrete model, standard configuration), raising questions about the practical utility of the generated text beyond explainability. The calibration correction uses a crude binary threshold (50%) rather than probabilistic weighting. Additionally, the teacher model requires substantial compute (32B parameters), complicating the "locally deployable" narrative for the training phase. The evaluation relies solely on TCGA data without cross-dataset validation, and very long reports cause coherence degradation and missing probability statements.

“hidden-state predictions were consistently stronger than verbalized ones”
Gögl & Yau, Sec. 4.2 · Table 1
“very long reports can challenge the model: textual coherence may drop and the stated 3-year probability may occasionally be omitted”
Gögl & Yau, Sec. 4.2 · Section 4.2
“Both outputs exhibit degraded English fluency; the right example further fails to provide an explicit verbalized probability”
Gögl & Yau, Appendix A · Appendix A
Evidence and comparison

The comparison to BERTSurv is potentially unfair since the authors reimplemented it without native gene expression handling, concatenating offline-learned GE latents to text embeddings. The teacher model surprisingly achieves strong verbalized performance ($C^{\text{td}} = 0.746$), suggesting the student (1.5B) has significant headroom for improvement. The convex combination of hidden-state and verbalized curves provides modest but consistent gains (e.g., $C^{\text{td}}$ 0.766 vs 0.765 for hidden-state alone), though the improvement is marginal given the complexity of the text generation pathway.

“The teacher achieved substantially better verbalized performance”
Gögl & Yau, Sec. 4.2 · Section 4.2
“0.765 (hidden-state) vs 0.626 (verbalized) vs 0.766 (combined)”
Gögl & Yau, Table 1 · Table 1
“no public BERTSurv implementation is available, we re-implemented it in-house”
Gögl & Yau, Appendix C.4 · Appendix C.4
Reproducibility

The paper provides detailed hyperparameters in Appendix C.4, including learning rates, layer freezing strategy (last 18 layers), and token truncation (820 tokens). However, no code repository or model weights are mentioned. Reproduction would require substantial compute for the 32B teacher model and careful replication of the TCGA preprocessing pipeline, which involves specific covariate grouping and gene expression imputation. The calibration correction threshold and exponential fitting of teacher outputs introduce implementation details not fully specified (e.g., exact regex patterns for probability extraction). Random seeds and exact training duration are not reported.

“we freeze all but the last 18 layers of the student LLM during fine-tuning, truncate pathology reports to 820 tokens”
Gögl & Yau, Appendix C.4 · Appendix C.4
“The remaining 8,902 samples were split into training/validation/test sets in proportions 70/10/20%”
Gögl & Yau, Appendix B · Appendix B
Abstract

We study multimodal survival analysis integrating clinical text, tabular covariates, and genomic profiles using locally deployable large language models (LLMs). As many institutions face tight computational and privacy constraints, this setting motivates the use of lightweight, on-premises models. Our approach jointly estimates calibrated survival probabilities and generates concise, evidence-grounded prognosis text via teacher-student distillation and principled multimodal fusion. On a TCGA cohort, it outperforms standard baselines, avoids reliance on cloud services and associated privacy concerns, and reduces the risk of hallucinated or miscalibrated estimates that can be observed in base LLMs.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.