Efficient Fine-Tuning Methods for Portuguese Question Answering: A Comparative Study of PEFT on BERTimbau and Exploratory Evaluation of Generative LLMs
This paper addresses computational barriers for Brazilian Portuguese question answering by systematically evaluating Parameter-Efficient Fine-Tuning (PEFT) methods on BERTimbau models using the SQuAD-BR dataset. The authors test LoRA, DoRA, QLoRA, and QDoRA across Base (110M) and Large (335M) variants, demonstrating that LoRA achieves 95.8% of full fine-tuning performance while reducing training time by 73.5%. A key finding is that PEFT methods require substantially higher learning rates ($2\times 10^{-4}$) than standard BERT fine-tuning to achieve optimal results, with quantization resilience favoring larger models.
The paper delivers a solid empirical contribution validating three hypotheses regarding PEFT efficiency, scale robustness to 4-bit quantization, and optimization sensitivity. The experimental design is rigorous, with full fine-tuning baselines re-executed under identical hardware conditions for fair comparison. The striking finding that full fine-tuning of BERTimbau-Large collapses at high learning rates ($lr=2\times 10^{-4}$, F1=3.02) while PEFT methods remain stable (LoRA: F1=81.32) provides compelling evidence that low-rank updates act as implicit regularizers. The exploratory evaluation of generative LLMs is appropriately scoped as supplementary analysis.
The core empirical findings regarding LoRA's efficiency gains are robust and well-documented across 40 experimental configurations. The quantization resilience analysis—which shows BERTimbau-Large loses only 4.83 F1 points with QLoRA versus 9.56 for Base models—provides actionable guidance for practitioners with consumer-grade GPUs. The memory consumption benchmarks are comprehensive: QLoRA reduces peak GPU memory by 86.9% for Base (1,897 MB vs 14,493 MB) and 81.9% for Large (3,281 MB vs 18,125 MB), enabling large model training on single GPUs with 20GB VRAM.
The comparison with generative LLMs (Tucano, Sabiá), while explicitly framed as exploratory, suffers from a structural mismatch: decoder-only autoregressive models are evaluated on an extractive span-selection benchmark designed for encoder architectures, compromising the validity of direct F1 comparisons. The dismissal of DoRA appears premature—the 28% training overhead might be justified in scenarios requiring better optimization stability or different hyperparameter configurations beyond the fixed $r=16$ grid search. Additionally, the study's scope is limited to a single dataset (SQuAD-BR), leaving open questions about generalization to other Portuguese QA domains or more challenging reasoning tasks.
The evidence strongly supports the central claims about PEFT efficiency relative to full fine-tuning, with comparisons conducted under identical hardware (NVIDIA RTX A4500) and software conditions. The authors appropriately note minor discrepancies between their re-executed baselines and original BERTimbau results due to "distinct software versions and random seeds," which slightly weakens absolute performance claims but preserves the validity of relative comparisons. The paper adequately contextualizes its work against prior PEFT literature (LoRA, DoRA, QLoRA). However, the generative model evaluation lacks the methodological rigor of the main experiments, focusing primarily on resource trade-offs rather than architectural suitability.
The paper provides strong reproducibility documentation including exact hyperparameters: LoRA rank $r=16$, scaling factor $\alpha=32$, target modules (Q, K, V, O projections), dropout 0.1, and specific learning rates ($2\times 10^{-4}$ vs $4.25\times 10^{-5}$). The authors report optimizer details (AdamW, weight decay 0.01), batch sizes (16 for Base, 8 for Large), sequence length (384), gradient clipping (norm 1.0), and complete software stack versions (PyTorch 2.1.0, Transformers 4.36.0, PEFT 0.7.1, bitsandbytes 0.41.0, CUDA 12.2). A GitHub repository is referenced for prompts. The primary gap is the absence of reported random seeds, which could affect exact replication of F1 scores.
Although large language models have transformed natural language processing, their computational costs create accessibility barriers for low-resource languages such as Brazilian Portuguese. This work presents a systematic evaluation of Parameter-Efficient Fine-Tuning (PEFT) and quantization techniques applied to BERTimbau for Question Answering on SQuAD-BR, the Brazilian Portuguese translation of SQuAD v1. We evaluate 40 configurations combining four PEFT methods (LoRA, DoRA, QLoRA, QDoRA) across two model sizes (Base: 110M, Large: 335M parameters). Our findings reveal three critical insights: (1) LoRA achieves 95.8\% of baseline performance on BERTimbau-Large while reducing training time by 73.5\% (F1=81.32 vs 84.86); (2) higher learning rates (2e-4) substantially improve PEFT performance, with F1 gains of up to +19.71 points over standard rates; and (3) larger models show twice the quantization resilience (loss of 4.83 vs 9.56 F1 points). These results demonstrate that encoder-based models can be efficiently fine-tuned for extractive Brazilian Portuguese QA with substantially lower computational cost than large generative LLMs, promoting more sustainable approaches aligned with \textit{Green AI} principles. An exploratory evaluation of Tucano and Sabi\'a on the same extractive QA benchmark shows that while generative models can reach competitive F1 scores with LoRA fine-tuning, they require up to 4.2$\times$ more GPU memory and 3$\times$ more training time than BERTimbau-Base, reinforcing the efficiency advantage of smaller encoder-based architectures for this task.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.