Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects
BanglaVerse introduces a culturally grounded benchmark evaluating vision-language models on Bengali culture across standard Bangla, four historically linked languages, and five regional dialects. Built from 1,152 manually curated images expanded to ~32.3K artifacts, the work reveals that standard Bangla evaluation substantially overestimates model capabilities compared to dialectal settings. The core finding—that missing cultural knowledge, not visual grounding alone, is the primary bottleneck—challenges conventional multimodal evaluation practices for underrepresented languages.
The paper presents a methodologically sound empirical study with rigorous annotation protocols ($\kappa=0.87$ inter-annotator agreement) and commendable dialectal coverage. The central claim that "evaluating only standard Bangla underestimates how fragile multilingual VLMs can be when the same cultural content is expressed through dialectal and cross-lingual variation" (Section 1) is convincingly demonstrated across Gemma, Qwen, and GPT-4.1 families. However, the reliance on synthetic dialect generation via fine-tuned Qwen2.5-3B-Instruct—rather than native speaker collection—introduces authenticity concerns, and the base dataset of 1,152 images is modest compared to existing Bangla VQA resources (Table 1).
The multi-stage annotation pipeline with cross-verification and adjudication establishes strong reliability, evidenced by $\kappa=0.87$ Cohen's kappa (Section 3.2) and 94% of dialectal artifacts rated "Flawless / Highly Authentic" in human evaluation (Appendix B). The controlled experimental design—holding images constant while varying linguistic forms—effectively isolates dialectal robustness, demonstrating that "dialectal robustness is substantially weaker for free-form generation than for answer-constrained reasoning" (Section 5, RQ1). The counter-intuitive prompting analysis showing that "few-shot prompting consistently degrades VQA performance" while "CoT yields the largest gains in knowledge-heavy domains" (Appendix C.2) offers valuable practical insights for deploying VLMs in cultural contexts.
The primary limitation is scale and authenticity: 1,152 base images is small compared to synthetic alternatives like ChitroJera (Table 1), and dialectal variants are generated via fine-tuned models rather than collected from native speech communities (Section 3.3), potentially introducing synthetic artifacts. The paper acknowledges that "$f(d)$ is a heuristic proxy" for domain difficulty (Section 5, RQ3), yet draws strong conclusions about knowledge versus grounding without ablation studies or controlled manipulation of visual versus textual difficulty. Additionally, potential data contamination is not addressed—evaluated models like GPT-4.1-mini and Gemini may have encountered the culturally significant images during pretraining, particularly given that Wikipedia and Banglapedia are cited as sources (Section 3.1). The few-shot degradation finding, while interesting, lacks mechanistic explanation beyond speculation about "entity bias" (Appendix C.2).
The comparison to related work in Table 1 is accurate and comprehensive, correctly positioning BanglaVerse as the only benchmark combining VQA, captioning, cultural awareness, multilingual coverage, and dialectal diversity. Evidence supports the claim that "Hindi and Urdu often preserve cultural meaning better in caption generation than their lower VQA scores alone would suggest" (Section 5, RQ2), though statistical significance testing is absent across the 32K+ artifacts. The analysis of historically linked languages is descriptive rather than theoretically grounded in linguistics—the historical connection between Bangla and Hindi/Urdu is asserted (Section 2.1) but not empirically validated as a causal factor for transfer performance. The claim that "the main bottleneck is missing cultural knowledge rather than visual grounding alone" (Abstract) relies partly on correlational domain-level patterns rather than controlled interventions.
The paper provides substantial implementation detail: Appendix A lists fine-tuning hyperparameters for the dialect generation model (Table 4: LoRA rank $r=16$, $\alpha=32$, learning rate $5\times 10^{-5}$, 3 epochs, 8,600 steps), Appendix B describes human evaluation protocols (3-point scale, 400 samples, $\kappa=0.89$ inter-annotator agreement), and Appendix C includes full prompt templates. However, critical reproducibility gaps remain: random seeds are unspecified for either the dialect generation or VLM evaluation, computational resource requirements (GPU type, hours) are omitted, and while the GitHub URL appears in a footnote, the repository contents and data splits are not described. The "Human Consistency Score ranging from 90.89 to 92.01" (Section 4.3) lacks confidence intervals or variance estimates, making reliability assessment difficult. The fine-tuning dataset BanglaDial is cited but its license and availability are not specified.
Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.3K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.