Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects

cs.CL cs.CV Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Shubhashis Roy Dipta, Rubaya Tabassum, Ariful Ekraj Hridoy, Mehraj Mahmood, Mahbub E Sobhani, Md. Tarek Hasan, Swakkhar Shatabda · Mar 22, 2026
Local to this browser
What it does
BanglaVerse introduces a culturally grounded benchmark evaluating vision-language models on Bengali culture across standard Bangla, four historically linked languages, and five regional dialects. Built from 1,152 manually curated images...
Why it matters
3K artifacts, the work reveals that standard Bangla evaluation substantially overestimates model capabilities compared to dialectal settings. The core finding—that missing cultural knowledge, not visual grounding alone, is the primary...
Main concern
The paper presents a methodologically sound empirical study with rigorous annotation protocols ($\kappa=0. 87$ inter-annotator agreement) and commendable dialectal coverage.
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

BanglaVerse introduces a culturally grounded benchmark evaluating vision-language models on Bengali culture across standard Bangla, four historically linked languages, and five regional dialects. Built from 1,152 manually curated images expanded to ~32.3K artifacts, the work reveals that standard Bangla evaluation substantially overestimates model capabilities compared to dialectal settings. The core finding—that missing cultural knowledge, not visual grounding alone, is the primary bottleneck—challenges conventional multimodal evaluation practices for underrepresented languages.

Critical review
Verdict
Bottom line

The paper presents a methodologically sound empirical study with rigorous annotation protocols ($\kappa=0.87$ inter-annotator agreement) and commendable dialectal coverage. The central claim that "evaluating only standard Bangla underestimates how fragile multilingual VLMs can be when the same cultural content is expressed through dialectal and cross-lingual variation" (Section 1) is convincingly demonstrated across Gemma, Qwen, and GPT-4.1 families. However, the reliance on synthetic dialect generation via fine-tuned Qwen2.5-3B-Instruct—rather than native speaker collection—introduces authenticity concerns, and the base dataset of 1,152 images is modest compared to existing Bangla VQA resources (Table 1).

“evaluating only standard Bangla underestimates how fragile multilingual VLMs can be when the same cultural content is expressed through dialectal and cross-lingual variation”
Sayeedi et al., Sec. 1 · Section 1
“Bengali VQA 2.0 ... 13,046 ... ChitroJera ... 15,292 ... BanglaVerse (Ours) ... 32,256”
Sayeedi et al., Table 1 · Table 1
What holds up

The multi-stage annotation pipeline with cross-verification and adjudication establishes strong reliability, evidenced by $\kappa=0.87$ Cohen's kappa (Section 3.2) and 94% of dialectal artifacts rated "Flawless / Highly Authentic" in human evaluation (Appendix B). The controlled experimental design—holding images constant while varying linguistic forms—effectively isolates dialectal robustness, demonstrating that "dialectal robustness is substantially weaker for free-form generation than for answer-constrained reasoning" (Section 5, RQ1). The counter-intuitive prompting analysis showing that "few-shot prompting consistently degrades VQA performance" while "CoT yields the largest gains in knowledge-heavy domains" (Appendix C.2) offers valuable practical insights for deploying VLMs in cultural contexts.

“To assess reliability, we computed inter-annotator agreement ... and obtained a score of $\kappa=0.87$”
Sayeedi et al., Sec. 3.2 · Section 3.2
“2 (Flawless / Highly Authentic) ... 376 ... 94.0%”
Sayeedi et al., Appendix B · Appendix B, Table 6
“dialectal robustness is substantially weaker for free-form generation than for answer-constrained reasoning”
Sayeedi et al., Sec. 5 · Section 5, RQ1
“few-shot prompting consistently degrades VQA performance ... CoT yields the largest gains in knowledge-heavy domains like National Achievements (+6.97)”
Sayeedi et al., Appendix C.2 · Appendix C.2
Main concerns

The primary limitation is scale and authenticity: 1,152 base images is small compared to synthetic alternatives like ChitroJera (Table 1), and dialectal variants are generated via fine-tuned models rather than collected from native speech communities (Section 3.3), potentially introducing synthetic artifacts. The paper acknowledges that "$f(d)$ is a heuristic proxy" for domain difficulty (Section 5, RQ3), yet draws strong conclusions about knowledge versus grounding without ablation studies or controlled manipulation of visual versus textual difficulty. Additionally, potential data contamination is not addressed—evaluated models like GPT-4.1-mini and Gemini may have encountered the culturally significant images during pretraining, particularly given that Wikipedia and Banglapedia are cited as sources (Section 3.1). The few-shot degradation finding, while interesting, lacks mechanistic explanation beyond speculation about "entity bias" (Appendix C.2).

“We first fine-tuned the Qwen2.5-3B-Instruct model on ... BanglaDial ... and later used to convert the source Bangla captions and VQA items into dialect-specific forms”
Sayeedi et al., Sec. 3.3 · Section 3.3
“We note $f(d)$ is a heuristic proxy; low performance in Politics could also reflect weak entity recognition or answer-option confusability rather than a clean 'knowledge versus grounding' distinction”
Sayeedi et al., Sec. 5 · Section 5, RQ3
“We collected images ... from ... Wikipedia ... Banglapedia”
Sayeedi et al., Sec. 3.1 · Section 3.1
Evidence and comparison

The comparison to related work in Table 1 is accurate and comprehensive, correctly positioning BanglaVerse as the only benchmark combining VQA, captioning, cultural awareness, multilingual coverage, and dialectal diversity. Evidence supports the claim that "Hindi and Urdu often preserve cultural meaning better in caption generation than their lower VQA scores alone would suggest" (Section 5, RQ2), though statistical significance testing is absent across the 32K+ artifacts. The analysis of historically linked languages is descriptive rather than theoretically grounded in linguistics—the historical connection between Bangla and Hindi/Urdu is asserted (Section 2.1) but not empirically validated as a causal factor for transfer performance. The claim that "the main bottleneck is missing cultural knowledge rather than visual grounding alone" (Abstract) relies partly on correlational domain-level patterns rather than controlled interventions.

“BanglaVerse (Ours) ... ✓ ... ✓ ... 32,256 ... ✓ ... ✓ ... ✓ ... ✓”
Sayeedi et al., Table 1 · Table 1
“Hindi's average caption quality (53.08) is higher than Bangla's (50.59), even though its VQA is lower”
Sayeedi et al., Sec. 5 · Section 5, RQ2
“the main bottleneck is missing cultural knowledge rather than visual grounding alone”
Sayeedi et al., Abstract · Abstract
Reproducibility

The paper provides substantial implementation detail: Appendix A lists fine-tuning hyperparameters for the dialect generation model (Table 4: LoRA rank $r=16$, $\alpha=32$, learning rate $5\times 10^{-5}$, 3 epochs, 8,600 steps), Appendix B describes human evaluation protocols (3-point scale, 400 samples, $\kappa=0.89$ inter-annotator agreement), and Appendix C includes full prompt templates. However, critical reproducibility gaps remain: random seeds are unspecified for either the dialect generation or VLM evaluation, computational resource requirements (GPU type, hours) are omitted, and while the GitHub URL appears in a footnote, the repository contents and data splits are not described. The "Human Consistency Score ranging from 90.89 to 92.01" (Section 4.3) lacks confidence intervals or variance estimates, making reliability assessment difficult. The fine-tuning dataset BanglaDial is cited but its license and availability are not specified.

“LoRA Rank (r) ... 16 ... LoRA Alpha ($\alpha$) ... 32 ... Learning Rate ... $5\times 10^{-5}$ ... Training Epochs ... 3 ... 8,600 global training steps”
Sayeedi et al., Appendix A · Appendix A, Table 4
“The LLM-as-a-Judge achieved a Human Consistency Score ranging from 90.89 to 92.01 out of 100”
Sayeedi et al., Sec. 4.3 · Section 4.3
“Cohen's Kappa ($\kappa$) of 0.89, indicating near-perfect agreement”
Sayeedi et al., Appendix B · Appendix B
Abstract

Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.3K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.