Efficient Fine-Tuning Methods for Portuguese Question Answering: A Comparative Study of PEFT on BERTimbau and Exploratory Evaluation of Generative LLMs

cs.CL cs.AI cs.LG Mariela M. Nina, Caio Veloso Costa, Lilian Berton, Didier A. Vega-Oliveros · Mar 22, 2026
Local to this browser
What it does
This paper addresses computational barriers for Brazilian Portuguese question answering by systematically evaluating Parameter-Efficient Fine-Tuning (PEFT) methods on BERTimbau models using the SQuAD-BR dataset. The authors test LoRA,...
Why it matters
5%. A key finding is that PEFT methods require substantially higher learning rates ($2\times 10^{-4}$) than standard BERT fine-tuning to achieve optimal results, with quantization resilience favoring larger models.
Main concern
The paper delivers a solid empirical contribution validating three hypotheses regarding PEFT efficiency, scale robustness to 4-bit quantization, and optimization sensitivity. The experimental design is rigorous, with full fine-tuning...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper addresses computational barriers for Brazilian Portuguese question answering by systematically evaluating Parameter-Efficient Fine-Tuning (PEFT) methods on BERTimbau models using the SQuAD-BR dataset. The authors test LoRA, DoRA, QLoRA, and QDoRA across Base (110M) and Large (335M) variants, demonstrating that LoRA achieves 95.8% of full fine-tuning performance while reducing training time by 73.5%. A key finding is that PEFT methods require substantially higher learning rates ($2\times 10^{-4}$) than standard BERT fine-tuning to achieve optimal results, with quantization resilience favoring larger models.

Critical review
Verdict
Bottom line

The paper delivers a solid empirical contribution validating three hypotheses regarding PEFT efficiency, scale robustness to 4-bit quantization, and optimization sensitivity. The experimental design is rigorous, with full fine-tuning baselines re-executed under identical hardware conditions for fair comparison. The striking finding that full fine-tuning of BERTimbau-Large collapses at high learning rates ($lr=2\times 10^{-4}$, F1=3.02) while PEFT methods remain stable (LoRA: F1=81.32) provides compelling evidence that low-rank updates act as implicit regularizers. The exploratory evaluation of generative LLMs is appropriately scoped as supplementary analysis.

“In BERTimbau-Large, full fine-tuning collapses almost completely (F1=3.02 with $lr=2\times 10^{-4}$), while PEFT methods remain stable.”
paper · Section 6
“LoRA achieves 95.8% of baseline performance on BERTimbau-Large while reducing training time by 73.5% (F1=81.32 vs 84.86)”
paper · Abstract
What holds up

The core empirical findings regarding LoRA's efficiency gains are robust and well-documented across 40 experimental configurations. The quantization resilience analysis—which shows BERTimbau-Large loses only 4.83 F1 points with QLoRA versus 9.56 for Base models—provides actionable guidance for practitioners with consumer-grade GPUs. The memory consumption benchmarks are comprehensive: QLoRA reduces peak GPU memory by 86.9% for Base (1,897 MB vs 14,493 MB) and 81.9% for Large (3,281 MB vs 18,125 MB), enabling large model training on single GPUs with 20GB VRAM.

“QLoRA on Large shows a degradation of only $-4.83$ F1 points (vs $-9.56$ on Base), an approximate $2\times$ difference”
paper · Section 5.2
“QLoRA achieves the largest reductions: 86.9% for Base (1,897 MB vs. 14,493 MB) and 81.9% for Large (3,281 MB vs. 18,125 MB)”
paper · Section 5.3
Main concerns

The comparison with generative LLMs (Tucano, Sabiá), while explicitly framed as exploratory, suffers from a structural mismatch: decoder-only autoregressive models are evaluated on an extractive span-selection benchmark designed for encoder architectures, compromising the validity of direct F1 comparisons. The dismissal of DoRA appears premature—the 28% training overhead might be justified in scenarios requiring better optimization stability or different hyperparameter configurations beyond the fixed $r=16$ grid search. Additionally, the study's scope is limited to a single dataset (SQuAD-BR), leaving open questions about generalization to other Portuguese QA domains or more challenging reasoning tasks.

“Unlike BERTimbau, these models are designed for free-text generation, which introduces a structural mismatch with SQuAD's exact-span extraction evaluation protocol.”
paper · Section 5.5
“DoRA introduces a consistent temporal overhead of approximately 28% in BERTimbau-Large without clear improvements in F1 or Exact Match.”
paper · Section 6
Evidence and comparison

The evidence strongly supports the central claims about PEFT efficiency relative to full fine-tuning, with comparisons conducted under identical hardware (NVIDIA RTX A4500) and software conditions. The authors appropriately note minor discrepancies between their re-executed baselines and original BERTimbau results due to "distinct software versions and random seeds," which slightly weakens absolute performance claims but preserves the validity of relative comparisons. The paper adequately contextualizes its work against prior PEFT literature (LoRA, DoRA, QLoRA). However, the generative model evaluation lacks the methodological rigor of the main experiments, focusing primarily on resource trade-offs rather than architectural suitability.

“The Full FT baseline values reported here are the result of our own re-execution under identical hardware and software conditions, enabling a fair comparison; they are close to, though not identical to, the original values reported by Souza et al. (2020), as we expect minor differences due to distinct software versions and random seeds.”
paper · Section 5.1
Reproducibility

The paper provides strong reproducibility documentation including exact hyperparameters: LoRA rank $r=16$, scaling factor $\alpha=32$, target modules (Q, K, V, O projections), dropout 0.1, and specific learning rates ($2\times 10^{-4}$ vs $4.25\times 10^{-5}$). The authors report optimizer details (AdamW, weight decay 0.01), batch sizes (16 for Base, 8 for Large), sequence length (384), gradient clipping (norm 1.0), and complete software stack versions (PyTorch 2.1.0, Transformers 4.36.0, PEFT 0.7.1, bitsandbytes 0.41.0, CUDA 12.2). A GitHub repository is referenced for prompts. The primary gap is the absence of reported random seeds, which could affect exact replication of F1 scores.

“Software stack: CUDA 12.2, PyTorch 2.1.0, Transformers 4.36.0, PEFT 0.7.1, and bitsandbytes 0.41.0.”
paper · Section 4.3
“LoRA rank $r=16$, scaling factor $\alpha=32$, target modules (the query, key, value, and output projection matrices of the attention mechanism), and a dropout rate of 0.1.”
paper · Section 4.2
“https://github.com/GPAM-ai/Efficient-FineTunning-QA-PEFT.git”
paper · Footnote 2
Abstract

Although large language models have transformed natural language processing, their computational costs create accessibility barriers for low-resource languages such as Brazilian Portuguese. This work presents a systematic evaluation of Parameter-Efficient Fine-Tuning (PEFT) and quantization techniques applied to BERTimbau for Question Answering on SQuAD-BR, the Brazilian Portuguese translation of SQuAD v1. We evaluate 40 configurations combining four PEFT methods (LoRA, DoRA, QLoRA, QDoRA) across two model sizes (Base: 110M, Large: 335M parameters). Our findings reveal three critical insights: (1) LoRA achieves 95.8\% of baseline performance on BERTimbau-Large while reducing training time by 73.5\% (F1=81.32 vs 84.86); (2) higher learning rates (2e-4) substantially improve PEFT performance, with F1 gains of up to +19.71 points over standard rates; and (3) larger models show twice the quantization resilience (loss of 4.83 vs 9.56 F1 points). These results demonstrate that encoder-based models can be efficiently fine-tuned for extractive Brazilian Portuguese QA with substantially lower computational cost than large generative LLMs, promoting more sustainable approaches aligned with \textit{Green AI} principles. An exploratory evaluation of Tucano and Sabi\'a on the same extractive QA benchmark shows that while generative models can reach competitive F1 scores with LoRA fine-tuning, they require up to 4.2$\times$ more GPU memory and 3$\times$ more training time than BERTimbau-Base, reinforcing the efficiency advantage of smaller encoder-based architectures for this task.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.