Benchmarking Bengali Dialectal Bias: A Multi-Stage Framework Integrating RAG-Based Translation and Human-Augmented RLAIF

cs.CL cs.AI cs.CY K. M. Jubair Sami, Dipto Sumit, Ariyan Hossain, Farig Sadeque · Mar 22, 2026

What it does

Why it matters

The authors propose a two-phase framework combining RAG-based translation to create dialectal benchmarks with an RLAIF-inspired evaluation protocol that uses CoT-first reasoning and multi-judge validation. They expose the catastrophic...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper tackles the problem of measuring dialectal bias in LLMs for Bengali, a low-resource language with nine major regional variants. The authors propose a two-phase framework combining RAG-based translation to create dialectal benchmarks with an RLAIF-inspired evaluation protocol that uses CoT-first reasoning and multi-judge validation. They expose the catastrophic failure of traditional metrics like BLEU and WER for agglutinative dialectal Bengali, showing that LLM-as-judge better predicts human quality assessments.

Critical review

Verdict

Bottom line

The paper makes a solid methodological contribution by exposing how traditional n-gram and subword embedding metrics collapse when confronted with Bengali's agglutinative informality and non-standardized orthography. Their finding that LLM-as-judge outperforms legacy metrics (CCC = 0.506 for Gemma-3-27B-IT vs. 0.358 for BERTScore and 0.065 for BLEU) is well-supported by human correlation data from 125 annotated samples. The benchmark results are stark: responses to Chittagong dialect score 5.44/10 vs. 7.68/10 for Tangail, confirming that dialectal divergence correlates with performance degradation. However, the central claim that "increased model scale does not consistently mitigate this bias" (Abstract) is substantiated only by a scatter of model-wise averages rather than systematic scaling analysis, and the paper overstates the generalizability of findings from a single low-resource language.

“Gemma-3-27B-IT achieves CCC 0.506; BLEU achieves CCC 0.065”

Sami et al., Sec. 4.1 · Table 3

“responses to the highly divergent Chittagong dialect score 5.44/10, compared to 7.68/10 for Tangail”

Sami et al., Abstract · Results

What holds up

The methodological insight that traditional metrics fail on Bengali dialects due to spacing inconsistencies and agglutination is rigorous and well-demonstrated. The authors correctly note that n-gram metrics "completely fail due to tokenization artifacts, even when sentences are semantically identical," offering the example of "ভালা লাগে না" vs "ভালালাগেনা" (Sec. 3.1.3). Their RLAIF evaluation framework is theoretically grounded in Bai et al. (2022)'s Constitutional AI, properly citing the need for Chain-of-Thought reasoning before score assignment per Zheng et al. (2023). The introduction of Lin's Concordance Correlation Coefficient (1989) rather than Pearson correlation for judge validation represents best practice, and achieving CCC = 0.861 between Gemini and GPT-OSS judges validates their multi-judge protocol.

“traditional n-gram/word boundary metrics (BLEU, WER) often completely fail due to tokenization artifacts, even when sentences are semantically identical”

Sami et al., Sec. 3.1.3 · Translation Quality Evaluation

“CCC = 0.8614 between Gemini and GPT-OSS”

Sami et al., Sec. 3.4 · Multi-Judge Validation

“Language models are more likely to suggest that speakers of African American English be assigned less prestigious jobs, be convicted of crimes, and be sentenced to death”

Hofmann et al., arXiv:2403.00742 · Abstract

Main concerns

First, the sample size for human validation is concerningly small: only 125 translation pairs (25 per dialect) for metric validation, and 100 samples per confidence level for RLAIF calibration. Given claims of superiority over legacy metrics, this warrants larger-scale validation. Second, the Critical Bias Sensitivity (CBS) metric (Equation 3) lacks statistical rigor: the threshold of 4.0 for "severe bias" is presented as an example without theoretical or empirical justification, yet the metric value itself depends critically on this arbitrary cutoff. Third, per-dialect CCC variation for the Gemma judge is extreme (0.729 for Mymensingh vs. 0.186 for Noakhali), suggesting judge calibration varies dramatically by dialect complexity—a confound not addressed in the main validation claims. Finally, reliance on proprietary models (Gemini 2.5 Flash, Gemini Embedding-001) with no code or open data undermines reproducibility.

“N=125, 25 per dialect”

Sami et al., Sec. 3.1.3 · Human Correlation Analysis

“Critical Set denotes rows where Score < Threshold, e.g., 4.0”

Sami et al., Sec. 3.4 · CBS equation

“CCC = 0.729 for Mymensingh vs. 0.186 for Noakhali”

Sami et al., Sec. 4.1 · LLM Judge Scores subsection

Evidence and comparison

The evidence supports the core conclusion that dialectal bias exists and correlates with linguistic divergence, but comparisons to prior work require scrutiny. The paper correctly positions itself relative to Hofmann et al. (2024), extending dialect prejudice research from English AAE to Bengali contexts, and appropriately distinguishes itself from Wasi et al. (2025) by focusing on regional rather than religious dialectal variation. However, the claim that LLM-as-judge is superior relies heavily on proprietary embedding models (Gemini Embedding-001) that exhibit well-documented saturation effects—cosine similarities compressed to 0.972 average with poor discrimination power (CCC = 0.074 vs. human judgments). The comparison to open-source alternatives like L3Cube SBERT (CCC = 0.358) is fair, but the paper doesn't adequately address whether the LLM judge's modest CCC = 0.506 is sufficient for high-stakes deployment, despite the CBS metric's attempt to capture this.

“Gemini Embedding-001 yields uniformly high similarities across all five dialects... confirming macro-level semantic preservation... however, this compressed dynamic range is insufficient to discriminate within-dialect quality variation, as reflected in a poor CCC of 0.074”

Sami et al., Sec. 4.1 · Gemini Embedding Saturation subsection

“While new dialectal resources are emerging, such as Vashantor... dialectal variation remains broadly underexplored”

Sami et al., Sec. 2.2 · Bengali NLP and Dialectal Variation

Reproducibility

Reproducibility is severely compromised. The paper provides no link to code, datasets, prompts, or model outputs. Critical components rely on proprietary or unreleased systems: Gemini 2.5 Flash as the primary judge, Gemini Embedding-001 for evaluation, and the Vashantor validation split access (only 1,250 pairs are reserved but no distribution mechanism is mentioned). The 4,000 gold-labeled question sets are claimed as a contribution but are not released. RAG pipeline hyperparameters (number of retrieved examples, weighting schemes for hybrid retrieval) are described textually but lack specific values necessary for replication. The 19 evaluated LLMs are listed in Appendix D, but inference parameters (temperature, sampling strategies) are not documented. For a paper presenting itself as establishing a "validated translation quality evaluation method" and "gold-standard benchmark," the absence of open artifacts undermines both validation and future benchmarking efforts.

“The system identifies relevant sentence pairs by fusing dense and sparse retrieval methods... applies adaptive weighting based on the query length”

Sami et al., Sec. 3.1.2 · Retrieval Module

“The 19 open-weight LLMs evaluated for dialectal bias detection span the following model families...”

Sami et al., Appendix D · Evaluated LLMs

Abstract

Large language models (LLMs) frequently exhibit performance biases against regional dialects of low-resource languages. However, frameworks to quantify these disparities remain scarce. We propose a two-phase framework to evaluate dialectal bias in LLM question-answering across nine Bengali dialects. First, we translate and gold-label standard Bengali questions into dialectal variants adopting a retrieval-augmented generation (RAG) pipeline to prepare 4,000 question sets. Since traditional translation quality evaluation metrics fail on unstandardized dialects, we evaluate fidelity using an LLM-as-a-judge, which human correlation confirms outperforms legacy metrics. Second, we benchmark 19 LLMs across these gold-labeled sets, running 68,395 RLAIF evaluations validated through multi-judge agreement and human fallback. Our findings reveal severe performance drops linked to linguistic divergence. For instance, responses to the highly divergent Chittagong dialect score 5.44/10, compared to 7.68/10 for Tangail. Furthermore, increased model scale does not consistently mitigate this bias. We contribute a validated translation quality evaluation method, a rigorous benchmark dataset, and a Critical Bias Sensitivity (CBS) metric for safety-critical applications.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.