Ara-Best-RQ: Multi Dialectal Arabic SSL

cs.CL Haroun Elleuch, Ryan Whetten, Salima Mdhaffar, Yannick Est\`eve, Fethi Bougares · Mar 23, 2026

What it does

Why it matters

Their 300M model achieves state-of-the-art dialect identification performance using fewer parameters than competing Whisper-based systems. This work helps close the gap for underrepresented Arabic dialects in speech technology.

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Ara-BEST-RQ introduces dedicated self-supervised speech models for Arabic dialects. The authors curate 5,640 hours of Creative Commons Arabic speech covering 20 dialects and train Conformer-based BEST-RQ models up to 600M parameters. Their 300M model achieves state-of-the-art dialect identification performance using fewer parameters than competing Whisper-based systems. This work helps close the gap for underrepresented Arabic dialects in speech technology.

Critical review

Verdict

Bottom line

The paper presents a valuable resource for Arabic speech processing by releasing the first large-scale multi-dialectal Arabic SSL models alongside a substantial crawled dataset. The core finding—that targeted pre-training on Arabic dialects outperforms generic multilingual models on dialect identification—is compelling and well-demonstrated. However, erratic scaling behavior undermines the contribution: the 600M model significantly underperforms the 300M model on dialect identification, and the 300M model fails to converge on the larger combined dataset. These issues suggest fundamental training instabilities that are not adequately explained.

“Crawled 300M ... 96.02 ... Crawled 600M ... 91.05”

Ara-BEST-RQ paper · Table 5

“300M ... Combined ... 6.10 ... 300M ... Crawled ... 3.86”

Ara-BEST-RQ paper · Table 4

What holds up

The 300M model's dialect identification results are impressive, achieving 96.02% accuracy on ADI-20 test set versus 94.83% for Whisper-large while using under half the parameters. On ASR, the model consistently outperforms same-sized baselines (HuBERT-large, XLS-R-128) across Egyptian, Moroccan, and Tunisian benchmarks with WER reductions of 10+ points on MGB-3 and MGB-5. The data curation methodology is thorough, with explicit filtering for offensive content and verification that dialect tags are not sourced from unreliable YouTube geotags.

“achieving new SoTA results while having less than half the parameters of the whisper-based system (637M)”

Ara-BEST-RQ paper · Table 5

“Ara-BEST-RQ crawled 300M ... 18.67 ... 30.85 ... 54.18 ... HuBERT-large ... 30.3 ... 52.54 ... 65.20”

Ara-BEST-RQ paper · Table 3

“We did not use the geotags provided by YouTube to source the dialect metadata, as we found them to be consistently unreliable”

Ara-BEST-RQ paper · Section 3.1

Main concerns

The scaling results are inverted and unexplained. The 600M model degrades to 91.05% accuracy on DID test compared to the 300M model's 96.02%, and remarkably w2v-BERT 2.0 using the same recipe did not converge at all. The authors acknowledge that "the 600M variants do not perform as well" but offer no explanation for why doubling parameters would harm performance. Similarly, the 300M model trained on the combined dataset fails to converge (validation loss 6.61 versus 3.81 on crawled data), suggesting the model cannot handle the diversity of the larger corpus. These failures indicate the training regime may not be robust across scales, yet the paper compares the best-performing 300M model against baselines without adequately contextualizing these instabilities.

“w2v-BERT 2.0 ... NC ... NC ... NC ... NC”

Ara-BEST-RQ paper · Table 5

“the 600M variants do not perform as well, especially on the test set”

Ara-BEST-RQ paper · Table 5

“The 300M model pretrained on the combined dataset fails to converge”

Ara-BEST-RQ paper · Table 4

“Train loss ... 6.61 ... Valid. loss ... 6.10”

Ara-BEST-RQ paper · Table 4

Evidence and comparison

The comparisons to HuBERT-large and XLS-R-128 are fair, using identical fine-tuning recipes with three-layer feedforward networks. However, the w2v-BERT 2.0 comparison uses a linear probe which "provides better performance" rather than the same architecture, potentially confounding the comparison. While the DID evaluation is rigorous using the standard ADI-20 benchmark, the ASR evaluation aggregates results across MSA (Common Voice) and dialectal datasets (MGB-3, MGB-5, TARIC-SLU), making it unclear whether the model specifically excels at dialectal speech or Arabic ASR generally. The paper claims "family-targeted pre-training... significantly improves downstream performance" but does not isolate the effect of dialectal data versus simply more Arabic data.

“w2v-BERT 2.0, where a linear probe provides better performance”

Ara-BEST-RQ paper · Section 4.2.1

“family-targeted pre-training on Arabic dialects significantly improves downstream performance compared to multilingual or monolingual models trained on non-Arabic data”

Ara-BEST-RQ paper · Section 1

Reproducibility

The authors pledge to release "models, code, and pre-processed datasets" at https://github.com/elyadata/Ara-BEST-RQ. Training details are reasonably complete: masking probability $p=0.15$ with mask length 4 frames (yielding ~60% masked frames), 4096 codebook entries, batch duration of 450 seconds, and hardware configurations (16×A100 for 300M, 32×H100 for 600M). However, critical hyperparameters including learning rate, optimizer, and training steps are omitted. The preprocessing pipeline using Silero VAD with 250ms segment merging is well-documented. The failure modes—300M not converging on combined data and 600M underperforming on DID—suggest significant hyperparameter sensitivity that future reproducers will need to navigate without explicit guidance from the authors.

“All models, code, and pre-processed datasets will be publicly released”

Ara-BEST-RQ paper · Abstract

“masking is applied with a mask length of 4 and probability 0.15 (resulting in a total mask of 60%”

Ara-BEST-RQ paper · Section 4.1

“batch duration of 450 seconds”

Ara-BEST-RQ paper · Section 4.1

Abstract

We present Ara-BEST-RQ, a family of self-supervised learning (SSL) models specifically designed for multi-dialectal Arabic speech processing. Leveraging 5,640 hours of crawled Creative Commons speech and combining it with publicly available datasets, we pre-train conformer-based BEST-RQ models up to 600M parameters. Our models are evaluated on dialect identification (DID) and automatic speech recognition (ASR) tasks, achieving state-of-the-art performance on the former while using fewer parameters than competing models. We demonstrate that family-targeted pre-training on Arabic dialects significantly improves downstream performance compared to multilingual or monolingual models trained on non-Arabic data. All models, code, and pre-processed datasets will be publicly released to support reproducibility and further research in Arabic speech technologies.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.