Parameter-Efficient Fine-Tuning for Medical Text Summarization: A Comparative Study of Lora, Prompt Tuning, and Full Fine-Tuning
Medical text summarization helps clinicians process millions of biomedical articles, but fine-tuning large language models demands prohibitive resources. This paper compares Low-Rank Adaptation (LoRA), Prompt Tuning, and full fine-tuning across Flan-T5-Small, Base, and Large on PubMed summarization. The counter-intuitive finding is that updating fewer than 1% of parameters via LoRA consistently outperforms full fine-tuning, suggesting that low-rank constraints provide effective regularization.
The paper presents a competent empirical comparison with statistical rigor (multiple seeds, significance tests), but its central claim—that LoRA outperforms full fine-tuning—warrants cautious interpretation. The finding that LoRA achieves 43.52±0.18 ROUGE-1 versus 40.67±0.21 for full fine-tuning on Flan-T5-Large is statistically significant but the effect size must be weighed against evaluation limitations. The work is sound as a systematic benchmark within its scope, though generalization claims beyond the Flan-T5 architecture and PubMed dataset remain unvalidated.
The experimental design demonstrates proper methodological hygiene: three random seeds per configuration with mean±stderr reporting, paired t-tests for statistical significance, and comprehensive hyperparameter sensitivity analyses. The LoRA rank sweep (r ∈ {4,8,16,32,64}) showing diminishing returns beyond r=16 is a useful practical finding. The efficiency comparison is clearly presented, establishing LoRA's utility for resource-constrained deployment. The paper honestly reports limitations including single-dataset evaluation and lack of human assessment.
The central claim that LoRA outperforms full fine-tuning, while statistically significant in these experiments, contradicts extensive prior work showing full fine-tuning typically achieves superior downstream performance when resources permit. The paper offers post-hoc speculation about "implicit regularization" and "catastrophic forgetting," but this explanation is untested. An alternative hypothesis—suboptimal hyperparameter tuning for full fine-tuning—is not ruled out. The study truncates PubMed articles averaging 3,100 words to 512 tokens, fundamentally altering the summarization task from long-document to medium-text summarization. With only three random seeds, variance estimates may be unreliable. The effect size also decreases with model scale (+4.57 for Small, +2.85 for Large), suggesting the phenomenon may attenuate.
The evidence supports the quantitative claims within the experimental scope, but the comparison to related work has gaps. The paper cites Aghajanyan et al. (2021) to support the "low-rank constraint as implicit regularization" hypothesis, but this paper discusses intrinsic dimensionality of fine-tuning rather than LoRA specifically. The citation to Van Veen et al. (2024) claiming LLMs can "outperform human experts" refers to a Research Square preprint rather than peer-reviewed work. The authors appropriately note their results are specific to Flan-T5 and may not transfer to decoder-only architectures (LLaMA, GPT), a crucial limitation given the field's shift toward these models.
Reproducibility is moderately supported. The code repository is publicly linked (https://github.com/eracoding/llm-medical-summarization), datasets are standard (PubMed), and model checkpoints are from Hugging Face. Table 1 documents hyperparameters for training, though the complete table is not provided in the extracted text. Critical missing details include: full optimization hyperparameters (learning rate schedules, warmup steps, batch size specifics), whether gradient accumulation was used, and whether early stopping based on validation metrics was applied. The paper states training was conducted on an NVIDIA RTX A6000 with 48GB memory, but does not report training duration or convergence curves. Without these details, assessing training stability and reproducing the claimed results would require substantial effort.
Fine-tuning large language models for domain-specific tasks such as medical text summarization demands substantial computational resources. Parameter-efficient fine-tuning (PEFT) methods offer promising alternatives by updating only a small fraction of parameters. This paper compares three adaptation approaches-Low-Rank Adaptation (LoRA), Prompt Tuning, and Full Fine-Tuning-across the Flan-T5 model family on the PubMed medical summarization dataset. Through experiments with multiple random seeds, we demonstrate that LoRA consistently outperforms full fine-tuning, achieving 43.52 +/- 0.18 ROUGE-1 on Flan-T5-Large with only 0.6% trainable parameters compared to 40.67 +/- 0.21 for full fine-tuning. Sensitivity analyses examine the impact of LoRA rank and prompt token count. Our findings suggest the low-rank constraint provides beneficial regularization, challenging assumptions about the necessity of full parameter updates. Code is available at https://github.com/eracoding/llm-medical-summarization
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.