Parameter-Efficient Fine-Tuning for Medical Text Summarization: A Comparative Study of Lora, Prompt Tuning, and Full Fine-Tuning

cs.CL cs.AI Ulugbek Shernazarov, Rostislav Svitsov, Bin Shi · Mar 23, 2026

What it does

Why it matters

This paper compares Low-Rank Adaptation (LoRA), Prompt Tuning, and full fine-tuning across Flan-T5-Small, Base, and Large on PubMed summarization. The counter-intuitive finding is that updating fewer than 1% of parameters via LoRA...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Medical text summarization helps clinicians process millions of biomedical articles, but fine-tuning large language models demands prohibitive resources. This paper compares Low-Rank Adaptation (LoRA), Prompt Tuning, and full fine-tuning across Flan-T5-Small, Base, and Large on PubMed summarization. The counter-intuitive finding is that updating fewer than 1% of parameters via LoRA consistently outperforms full fine-tuning, suggesting that low-rank constraints provide effective regularization.

Critical review

Verdict

Bottom line

The paper presents a competent empirical comparison with statistical rigor (multiple seeds, significance tests), but its central claim—that LoRA outperforms full fine-tuning—warrants cautious interpretation. The finding that LoRA achieves 43.52±0.18 ROUGE-1 versus 40.67±0.21 for full fine-tuning on Flan-T5-Large is statistically significant but the effect size must be weighed against evaluation limitations. The work is sound as a systematic benchmark within its scope, though generalization claims beyond the Flan-T5 architecture and PubMed dataset remain unvalidated.

“LoRA obtains 43.52±0.18 ROUGE-1 while updating only 0.6% of parameters, substantially exceeding the 40.67±0.21 achieved by full fine-tuning”

paper · Section 4.1

“The performance gap is statistically significant (p<0.01, paired t-test)”

paper · Section 4.2

What holds up

The experimental design demonstrates proper methodological hygiene: three random seeds per configuration with mean±stderr reporting, paired t-tests for statistical significance, and comprehensive hyperparameter sensitivity analyses. The LoRA rank sweep (r ∈ {4,8,16,32,64}) showing diminishing returns beyond r=16 is a useful practical finding. The efficiency comparison is clearly presented, establishing LoRA's utility for resource-constrained deployment. The paper honestly reports limitations including single-dataset evaluation and lack of human assessment.

“Performance improves substantially from r=4 to r=16, with diminishing returns beyond r=16”

paper · Section 4.3

“LoRA requires only 4.7 million trainable parameters compared to 783 million for full fine-tuning (166× reduction)”

paper · Table 3

Main concerns

The central claim that LoRA outperforms full fine-tuning, while statistically significant in these experiments, contradicts extensive prior work showing full fine-tuning typically achieves superior downstream performance when resources permit. The paper offers post-hoc speculation about "implicit regularization" and "catastrophic forgetting," but this explanation is untested. An alternative hypothesis—suboptimal hyperparameter tuning for full fine-tuning—is not ruled out. The study truncates PubMed articles averaging 3,100 words to 512 tokens, fundamentally altering the summarization task from long-document to medium-text summarization. With only three random seeds, variance estimates may be unreliable. The effect size also decreases with model scale (+4.57 for Small, +2.85 for Large), suggesting the phenomenon may attenuate.

“Articles average approximately 3,100 words...Input sequences were tokenized...with a maximum input length of 512 tokens and maximum output length of 150 tokens”

paper · Section 3.1

“We propose three complementary mechanisms. First, the low-rank constraint functions as implicit regularization...Second, LoRA preserves pre-trained representations...Third, the smaller parameter space may yield smoother loss landscapes”

paper · Section 5

“The relative improvement of LoRA over Full FT is actually largest on Small (+4.57) and smallest on Large (+2.85)”

paper · Section 5

Evidence and comparison

The evidence supports the quantitative claims within the experimental scope, but the comparison to related work has gaps. The paper cites Aghajanyan et al. (2021) to support the "low-rank constraint as implicit regularization" hypothesis, but this paper discusses intrinsic dimensionality of fine-tuning rather than LoRA specifically. The citation to Van Veen et al. (2024) claiming LLMs can "outperform human experts" refers to a Research Square preprint rather than peer-reviewed work. The authors appropriately note their results are specific to Flan-T5 and may not transfer to decoder-only architectures (LLaMA, GPT), a crucial limitation given the field's shift toward these models.

“Second, results are specific to the Flan-T5 architecture; other model families (e.g., LLaMA, GPT, Mistral) may exhibit different patterns with PEFT methods”

paper · Section 5

“Van Veen et al. [9] conducted an extensive evaluation showing that large language models adapted to clinical text can match or exceed the performance of human medical experts on summarization tasks”

paper · Section 2

Reproducibility

Reproducibility is moderately supported. The code repository is publicly linked (https://github.com/eracoding/llm-medical-summarization), datasets are standard (PubMed), and model checkpoints are from Hugging Face. Table 1 documents hyperparameters for training, though the complete table is not provided in the extracted text. Critical missing details include: full optimization hyperparameters (learning rate schedules, warmup steps, batch size specifics), whether gradient accumulation was used, and whether early stopping based on validation metrics was applied. The paper states training was conducted on an NVIDIA RTX A6000 with 48GB memory, but does not report training duration or convergence curves. Without these details, assessing training stability and reproducing the claimed results would require substantial effort.

“Code is available at https://github.com/eracoding/llm-medical-summarization”

paper · Abstract

“Table 1 presents the complete training hyperparameters and configuration details used in our experiments”

paper · Section 3

Abstract

Fine-tuning large language models for domain-specific tasks such as medical text summarization demands substantial computational resources. Parameter-efficient fine-tuning (PEFT) methods offer promising alternatives by updating only a small fraction of parameters. This paper compares three adaptation approaches-Low-Rank Adaptation (LoRA), Prompt Tuning, and Full Fine-Tuning-across the Flan-T5 model family on the PubMed medical summarization dataset. Through experiments with multiple random seeds, we demonstrate that LoRA consistently outperforms full fine-tuning, achieving 43.52 +/- 0.18 ROUGE-1 on Flan-T5-Large with only 0.6% trainable parameters compared to 40.67 +/- 0.21 for full fine-tuning. Sensitivity analyses examine the impact of LoRA rank and prompt token count. Our findings suggest the low-rank constraint provides beneficial regularization, challenging assumptions about the necessity of full parameter updates. Code is available at https://github.com/eracoding/llm-medical-summarization

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.