Multimodal Survival Analysis with Locally Deployable Large Language Models
This paper addresses multimodal survival analysis for clinical data, integrating pathology text, tabular covariates, and gene expression using locally deployable LLMs. The core innovation is a teacher-student distillation framework that trains a compact 1.5B parameter causal LLM to jointly produce calibrated survival curves and concise prognosis explanations. This matters because cloud-hosted medical AI raises privacy concerns, yet heavyweight local models are impractical for many institutions.
The paper presents a technically sound approach to privacy-preserving multimodal survival analysis. The dual-pathway architecture (hidden-state survival estimation plus verbalized generation) and the exploration of early/late fusion strategies demonstrate careful engineering. However, the "locally deployable" framing is undermined by reliance on a 32B parameter teacher model (DeepSeek-R1 Distill Qwen-32B) for distillation, which most institutions cannot host locally. The evaluation is also limited to a single TCGA cohort without external validation.
The multimodal fusion architecture is robust, offering both early concatenation and late gated fusion options that allow pre-training of covariate and gene-expression heads separately from the LLM. The calibration correction mechanism—which masks text loss contributions when teacher predictions conflict with observed outcomes—is a principled attempt to handle teacher miscalibration. The empirical results show consistent gains from multimodal fusion over unimodal baselines, and the authors honestly report failure modes in qualitative analysis.
The most significant issue is the performance gap between verbalized and hidden-state predictions: the verbalized pathway achieves $C^{\text{td}} = 0.626$ versus $0.765$ for hidden-state (discrete model, standard configuration), raising questions about the practical utility of the generated text beyond explainability. The calibration correction uses a crude binary threshold (50%) rather than probabilistic weighting. Additionally, the teacher model requires substantial compute (32B parameters), complicating the "locally deployable" narrative for the training phase. The evaluation relies solely on TCGA data without cross-dataset validation, and very long reports cause coherence degradation and missing probability statements.
The comparison to BERTSurv is potentially unfair since the authors reimplemented it without native gene expression handling, concatenating offline-learned GE latents to text embeddings. The teacher model surprisingly achieves strong verbalized performance ($C^{\text{td}} = 0.746$), suggesting the student (1.5B) has significant headroom for improvement. The convex combination of hidden-state and verbalized curves provides modest but consistent gains (e.g., $C^{\text{td}}$ 0.766 vs 0.765 for hidden-state alone), though the improvement is marginal given the complexity of the text generation pathway.
The paper provides detailed hyperparameters in Appendix C.4, including learning rates, layer freezing strategy (last 18 layers), and token truncation (820 tokens). However, no code repository or model weights are mentioned. Reproduction would require substantial compute for the 32B teacher model and careful replication of the TCGA preprocessing pipeline, which involves specific covariate grouping and gene expression imputation. The calibration correction threshold and exponential fitting of teacher outputs introduce implementation details not fully specified (e.g., exact regex patterns for probability extraction). Random seeds and exact training duration are not reported.
We study multimodal survival analysis integrating clinical text, tabular covariates, and genomic profiles using locally deployable large language models (LLMs). As many institutions face tight computational and privacy constraints, this setting motivates the use of lightweight, on-premises models. Our approach jointly estimates calibrated survival probabilities and generates concise, evidence-grounded prognosis text via teacher-student distillation and principled multimodal fusion. On a TCGA cohort, it outperforms standard baselines, avoids reliance on cloud services and associated privacy concerns, and reduces the risk of hallucinated or miscalibrated estimates that can be observed in base LLMs.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.