DATASHI: A Parallel English-Tashlhiyt Corpus for Orthography Normalization and Low-Resource Language Processing

cs.CL Nasser-Eddine Monir, Zakaria Baou · Mar 23, 2026
Local to this browser
What it does
DATASHI is a new parallel corpus for Tashlhiyt, a critically under-resourced Amazigh language spoken by millions in Morocco but lacking standardized digital resources. The paper introduces 5,000 English–Tashlhiyt sentence pairs, including...
Why it matters
5-Pro) achieves only moderate accuracy (35. 5% WER) and struggles with gemination and emphatic consonants.
Main concern
The paper makes a timely contribution by creating the first dedicated parallel resource for Tashlhiyt orthography normalization and providing a phonologically-informed diagnostic analysis of LLM errors. However, serious methodological...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

DATASHI is a new parallel corpus for Tashlhiyt, a critically under-resourced Amazigh language spoken by millions in Morocco but lacking standardized digital resources. The paper introduces 5,000 English–Tashlhiyt sentence pairs, including a 1,500-sentence subset with expert-standardized and non-standard user-generated versions, designed to benchmark orthography normalization. Using this corpus, the authors evaluate five state-of-the-art LLMs (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) on the normalization task, finding that even the best model (Gemini-2.5-Pro) achieves only moderate accuracy (35.5% WER) and struggles with gemination and emphatic consonants.

Critical review
Verdict
Bottom line

The paper makes a timely contribution by creating the first dedicated parallel resource for Tashlhiyt orthography normalization and providing a phonologically-informed diagnostic analysis of LLM errors. However, serious methodological limitations undercut the reliability of the evaluation: the test set is small (1,500 sentences), the text is elicited translationese rather than natural user-generated content, and the authors excluded outputs with invalid characters from their error analysis, potentially inflating reported metrics. The citation of models with future release dates (GPT-5, Gemini-2.5-Pro) relative to the paper’s 2026 date raises temporal consistency concerns that complicate reproducibility.

“During preliminary experiments, we observed that several LLMs occasionally introduced characters not included in the orthographic inventory specified in the prompt... they were excluded from the error analysis reported in this paper.”
Monir and Baou, Sec. 8 · Ethical Statement
What holds up

The dual-track corpus design—pairing non-standard community orthographies with expert-standardized references—is methodologically sound and directly addresses the 'rampant script variation' that has impeded prior Amazigh NLP work. The fine-grained phonological analysis of edit operations (deletions, substitutions, insertions) across geminates, emphatics, pharyngeals, and uvulars provides genuine diagnostic insight. For example, the data show deletion errors concentrate heavily in geminated consonants (Gemini-2.5-Pro: 2,144 deletions vs. Mistral's 4,056), confirming that consonant length contrasts pose the greatest challenge for LLMs.

“The concentration of deletion errors in gemination and emphatic consonants reflects the models' sensitivity to non-concatenative morphological and morphophonological structures characteristic of Amazigh languages.”
Monir and Baou, Table 10 · Section 6.2
Main concerns

The corpus construction relies on semi-controlled elicitation from English templates rather than naturally occurring text, limiting ecological validity for real-world social media normalization tasks. The evaluation suffers from statistical fragility: 1,500 sentences is modest for robust LLM benchmarking, and the exclusion of invalid character generations (e.g., č, š) from error analysis—while mentioned only in the closing ethical statement—represents a questionable decision that likely understates true error rates. Additionally, the comparison to prior work is necessarily thin given the resource scarcity, but the paper offers no baselines from simpler rule-based systems or smaller supervised models.

“The English sentences were created using a semi-controlled elicitation procedure: initial templates were drafted by the authors... then manually reviewed”
Monir and Baou, Sec. 3.1 · Corpus Creation
“they were excluded from the error analysis reported in this paper”
Monir and Baou, Sec. 8 · Ethical Statement
Evidence and comparison

The phonologically-conditioned error analysis substantiates the central claim that gemination is the primary source of normalization difficulty, with substitution and deletion errors dominating in this category across all evaluated models. However, the evidence does not fully support the broader claim of 'cross-lingual generalization' given the absence of cross-linguistic transfer experiments or comparison to models fine-tuned on related Amazigh varieties. The comparison to related work is necessarily limited by the field's nascent state, though the authors appropriately situate their contribution within surveys by Akallouch et al. (2025) and Besacier et al. (2014) on low-resource language technologies.

Reproducibility

The authors commit to releasing data and code via GitHub, which supports transparency. However, critical experimental details are omitted: API parameters (temperature, top-p, seed) for the LLM calls are not specified, and the exact prompt strings—while referenced—are relegated to the repository rather than included in the appendix. The paper uses proprietary models (GPT-5, Claude-Sonnet-4.5) that may not be publicly accessible or reproducible outside the study's timeframe. Furthermore, the exclusion criteria for invalid character insertions are poorly documented, raising concerns about what exactly was measured in the reported WER and Levenshtein distance calculations.

“The prompt itself is multi-faceted: it opens with a direct instruction to normalize the input, immediately followed by an exhaustive character set... The core of the prompt details the specific IRCAM-based rules”
Monir and Baou, Sec. 5.3 · Experimental Setup
Abstract

DATASHI is a new parallel English-Tashlhiyt corpus that fills a critical gap in computational resources for Amazigh languages. It contains 5,000 sentence pairs, including a 1,500-sentence subset with expert-standardized and non-standard user-generated versions, enabling systematic study of orthographic diversity and normalization. This dual design supports text-based NLP tasks - such as tokenization, translation, and normalization - and also serves as a foundation for read-speech data collection and multimodal alignment. Comprehensive evaluations with state-of-the-art Large Language Models (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) show clear improvements from zero-shot to few-shot prompting, with Gemini-2.5-Pro achieving the lowest word and character-level error rates and exhibiting robust cross-lingual generalization. A fine-grained analysis of edit operations - deletions, substitutions, and insertions - across phonological classes (geminates, emphatics, uvulars, and pharyngeals) further highlights model-specific sensitivities to marked Tashlhiyt features and provides new diagnostic insights for low-resource Amazigh orthography normalization.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.