Left Behind: Cross-Lingual Transfer as a Bridge for Low-Resource Languages in Large Language Models

cs.CL Abdul-Salem Beibitkhan · Mar 22, 2026
Local to this browser
What it does
This paper investigates whether cross-lingual transfer (CLT)—prompting models to translate queries to English, reason in English, then translate answers back—can bridge the performance gap for low-resource languages. The authors benchmark...
Why it matters
2–4. 3pp) but not English-first architectures, while revealing a concerning "fluency illusion" where models appear fluent in LRLs while producing less accurate content.
Main concern
The paper makes a valid observation about architectural differences in how models handle low-resource languages, but its conclusions are constrained by severe methodological limitations. The central claim—that CLT benefits bilingual models...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper investigates whether cross-lingual transfer (CLT)—prompting models to translate queries to English, reason in English, then translate answers back—can bridge the performance gap for low-resource languages. The authors benchmark eight LLMs across 2,000 responses in Kazakh and Mongolian, finding that CLT selectively benefits bilingual models (+2.2–4.3pp) but not English-first architectures, while revealing a concerning "fluency illusion" where models appear fluent in LRLs while producing less accurate content.

Critical review
Verdict
Bottom line

The paper makes a valid observation about architectural differences in how models handle low-resource languages, but its conclusions are constrained by severe methodological limitations. The central claim—that CLT benefits bilingual models (Qwen3, DeepSeek V3.2) more than English-first models—is supported by Table 2 data, though absolute gains are modest (under 5pp). However, the catastrophic failure of Aya Expanse (scoring 4.7–20% on LRLs vs. 82% on English) suggests implementation errors rather than genuine model limitations, especially given the note about "producing Kyrgyz instead of Kazakh." The tiny benchmark (50 questions) and lack of statistical testing further weaken confidence in the findings.

“Bilingual: C2 79.2, C3 81.3 (+2.2), C4 75.3, C5 79.7 (+4.3); Eng.-First: C2 88.2, C3 87.1 (-1.1)”
paper · Table 2
“Aya Expanse, explicitly designed for 100+ languages including Turkic family members, achieves 82.3% in English but collapses to 15.7% for Kazakh and 4.7% for Mongolian. The significant drop is mostly due to the model frequently producing Kyrgyz—a closely related but distinct language—instead of Kazakh.”
paper · Section 4.3
What holds up

The "fluency illusion" observation is the paper's strongest contribution: models maintain high fluency scores (87.5–89%) in Kazakh and Mongolian while accuracy drops 13–16 percentage points compared to English. This finding aligns with established research on hallucinations appearing plausible in non-English contexts. The disaggregation of results by model architecture (English-first vs. Bilingual vs. Multilingual) is methodologically sound and reveals meaningful differences in how explicit CLT scaffolding affects different training regimes.

“For Kazakh, Accuracy drops 13.0 percentage points (from 78.5% to 65.5%) while Fluency drops only 11.0pp (from 100% to 89.0%). The gap widens for Mongolian—Accuracy falls 16.0pp to 62.5% while Fluency remains at 87.5%.”
paper · Section 4.1
Main concerns

The benchmark is severely undersized at only 50 questions (10 per category), providing insufficient statistical power for the claimed generalizations. The evaluation relies entirely on Claude Sonnet 4.5 as an LLM-as-judge grading outputs—including those from its sibling Claude Opus 4.5—creating a clear conflict of interest that the authors only acknowledge as 'semi-automated grading' without bias mitigation strategies. No inter-annotator agreement or human verification is reported. The Aya Expanse results are suspiciously poor (4.7% on Mongolian) given its explicit training for 100+ languages; the paper's explanation of producing Kyrgyz instead of Kazakh suggests possible prompt template errors or API misconfiguration rather than fundamental model failure. No statistical significance tests are reported for the CLT gains.

“Our study is limited by its benchmark size (50 questions) and reliance on semi-automated grading with a single-person team”
paper · Section 4.4
“Grading is performed semi-automatically using Claude Sonnet 4.5, following the LLM-as-judge paradigm”
paper · Section 2
Evidence and comparison

The comparison to related work adequately cites recent multilingual evaluations (MEGA, Belebele) and specific Kazakh/Mongolian studies. The paper appropriately positions itself as extending Shi et al. (2023)'s work on multilingual chain-of-thought by testing explicit translation pipelines. However, the evidence does not strongly support the framing of CLT as a 'bridge' (as suggested by the title) when gains are marginal (0.4–4.3pp) and limited to bilingual architectures. The paper lacks a critical baseline: testing whether simple zero-shot translation to English (without the structured CLT pipeline) achieves similar results, which would clarify whether the three-step pipeline is necessary or if implicit English reasoning already occurs in bilingual models.

Reproducibility

The authors provide a GitHub link for data and code, but critical experimental details are missing. The paper does not report temperature settings, exact API versions, or the specific prompt templates used for the CLT pipeline steps—essential information given that prompt formatting heavily affects multilingual performance. The judge prompts and grading rubric specifics for the LLM-as-judge evaluation are not provided, making it impossible to verify whether the grading itself introduced bias. Without these hyperparameters and without human-verified labels for even a subset of responses, independent reproduction and validation of the Aya Expanse failure mode or the CLT gains is severely hampered.

“All data, code, and benchmark questions are available at https://github.com/abdulsal3m/left-behind-clt”
paper · Footnote 1
Abstract

We investigate how large language models perform on low-resource languages by benchmarking eight LLMs across five experimental conditions in English, Kazakh, and Mongolian. Using 50 hand-crafted questions spanning factual, reasoning, technical, and culturally grounded categories, we evaluate 2,000 responses on accuracy, fluency, and completeness. We find a consistent performance gap of 13.8-16.7 percentage points between English and low-resource language conditions, with models maintaining surface-level fluency while producing significantly less accurate content. Cross-lingual transfer-prompting models to reason in English before translating back-yields selective gains for bilingual architectures (+2.2pp to +4.3pp) but provides no benefit to English-dominant models. Our results demonstrate that current LLMs systematically underserve low-resource language communities, and that effective mitigation strategies are architecture-dependent rather than universal.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.