ViCLSR: A Supervised Contrastive Learning Framework with Natural Language Inference for Natural Language Understanding Tasks

cs.CL cs.AI cs.LG Tin Van Huynh, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen · Mar 22, 2026

What it does

Why it matters

Built on XLM-R Large (550M), the framework improves sentence embeddings for low-resource Vietnamese, reporting gains of +6. 97% F1 over PhoBERT on ViNLI and state-of-the-art results across five downstream tasks including fact-checking and...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

ViCLSR adapts supervised contrastive learning (SimCSE-style) to Vietnamese NLU by converting NLI entailment and contradiction pairs into positive and negative training signals. Built on XLM-R Large (550M), the framework improves sentence embeddings for low-resource Vietnamese, reporting gains of +6.97% F1 over PhoBERT on ViNLI and state-of-the-art results across five downstream tasks including fact-checking and machine reading comprehension.

Critical review

Verdict

Bottom line

The paper presents a solid empirical study applying supervised contrastive learning to Vietnamese, yielding consistent improvements on five benchmarks. However, the technical contribution is largely incremental—repackaging SimCSE for Vietnamese NLI data—and key comparisons are confounded by model capacity. While ViCLSR outperforms the monolingual PhoBERT by wide margins, it edges past CafeBERT (also XLM-R Large with continued pretraining) by less than 1.5% on all tasks, suggesting the gains stem more from the powerful multilingual backbone and task-specific fine-tuning than from a novel architectural insight.

“ViCLSR significantly outperforms the powerful monolingual pre-trained model PhoBERT on five benchmark NLU datasets such as ViNLI (+6.97% F1)”

Abstract · Abstract

“Compared to CafeBERT ↑0.80 ↑0.79 ↑1.34 ↑1.33”

Table 4 · Table 4

What holds up

The experimental rigor is commendable: comprehensive ablation studies identify $\tau=0.05$ as the optimal temperature and demonstrate that auxiliary MLM objectives are unnecessary (w/o MLM performs best). The alignment and uniformity analysis (Figure 6) provides convincing geometric evidence that ViCLSR achieves superior separation of negative pairs ($\ell_{\text{align}}$ C = 0.6102) and better distributed embeddings (Uniformity = -2.5012) compared to baselines. The attention visualizations (Figure 8) qualitatively support that contrastive training sharpens semantic focus on relevant tokens like "guitar" and "nhạc cụ (instrument).

“The best overall performance is observed at $\tau = 0.05$, achieving the highest accuracy on most datasets”

Section 5.1 · Section 5.1, Table 5

“excluding the MLM objective (w/o MLM) produces the highest accuracy on most tasks”

Section 5.1 · Section 5.1, Table 6

“ViCLSR achieves an Alignment - C score of 0.6102, significantly outperforming all other models”

Section 5.3 · Section 5.3

Main concerns

The comparison with DiffCSE is misleading: ViCLSR uses XLM-R Large (550M parameters) while DiffCSE uses RoBERTa-base (125M) as shown in Table 3, rendering the 24% accuracy gap on ViNLI a comparison of capacity rather than methodology. The deliberate exclusion of neutral NLI pairs ("neutral cases... are excluded from the contrastive dataset to avoid introducing noise") discards roughly one-third of available semantic relationships, potentially limiting the model's ability to handle nuanced inference. Additionally, the technical novelty is limited—the paper essentially applies the SimCSE supervised objective to Vietnamese without significant architectural innovation or adaptation beyond data preparation.

“Model... DiffCSE 125M... ViCLSR 550M”

Table 3 · Table 3

“neutral cases, which do not provide unambiguous similarity or dissimilarity cues, are excluded from the contrastive dataset to avoid introducing noise”

Section 3.2 · Section 3.2

“Compared to DiffCSE ↑24.51 ↑24.50”

Table 4 · Table 4

Evidence and comparison

The evidence supports the claim that supervised contrastive learning improves Vietnamese NLU, but baseline comparisons require qualification. While the 6.97% F1 gain over PhoBERT is substantial, the marginal 0.79% gain over CafeBERT (which also builds on XLM-R Large) suggests that continued domain pretraining and contrastive fine-tuning offer comparable benefits. The paper omits comparisons against modern embedding models (e.g., E5, GTE) and provides limited discussion of why CafeBERT—a Vietnamese-adapted XLM-R—performs so closely to the contrastively trained ViCLSR despite lacking the explicit semantic alignment objective.

“Compared to PhoBERT ↑6.93 ↑6.97... Compared to CafeBERT ↑0.80 ↑0.79”

Table 4 · Table 4

“CafeBERT... builds upon the multilingual XLM-R_{Large} architecture through continued pretraining on an 18GB corpus of Vietnamese text”

Section 4.2 · Section 4.2

Reproducibility

The authors commit to releasing the model upon acceptance and document hyperparameters clearly: contrastive pretraining uses learning rate $1\text{e-5}$, batch size 32, and 10 epochs, while downstream fine-tuning uses $3\text{e-5}$ for base models and $1\text{e-5}$ for large models. They utilize standard HuggingFace Transformers. However, concrete implementation details—such as maximum sequence length, dynamic hard negative sampling, or specific data augmentation pipelines—are not provided, and the current repository is unavailable. The reliance on specific Vietnamese NLI datasets (ViNLI and XNLI-Vi) is transparent, enabling independent reproduction if the checkpoint is released.

“The training configuration included a learning rate of 1e-5, a train batch size of 32, and a total of 10 training epochs”

Section 4.3 · Section 4.3

“We will provide an access link to it as soon as the article is accepted”

Page 2 · Footnote

Abstract

High-quality text representations are crucial for natural language understanding (NLU), but low-resource languages like Vietnamese face challenges due to limited annotated data. While pre-trained models like PhoBERT and CafeBERT perform well, their effectiveness is constrained by data scarcity. Contrastive learning (CL) has recently emerged as a promising approach for improving sentence representations, enabling models to effectively distinguish between semantically similar and dissimilar sentences. We propose ViCLSR (Vietnamese Contrastive Learning for Sentence Representations), a novel supervised contrastive learning framework specifically designed to optimize sentence embeddings for Vietnamese, leveraging existing natural language inference (NLI) datasets. Additionally, we propose a process to adapt existing Vietnamese datasets for supervised learning, ensuring compatibility with CL methods. Our experiments demonstrate that ViCLSR significantly outperforms the powerful monolingual pre-trained model PhoBERT on five benchmark NLU datasets such as ViNLI (+6.97% F1), ViWikiFC (+4.97% F1), ViFactCheck (+9.02% F1), UIT-ViCTSD (+5.36% F1), and ViMMRC2.0 (+4.33% Accuracy). ViCLSR shows that supervised contrastive learning can effectively address resource limitations in Vietnamese NLU tasks and improve sentence representation learning for low-resource languages. Furthermore, we conduct an in-depth analysis of the experimental results to uncover the factors contributing to the superior performance of contrastive learning models. ViCLSR is released for research purposes in advancing natural language processing tasks.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.