Ara-Best-RQ: Multi Dialectal Arabic SSL
Ara-BEST-RQ introduces dedicated self-supervised speech models for Arabic dialects. The authors curate 5,640 hours of Creative Commons Arabic speech covering 20 dialects and train Conformer-based BEST-RQ models up to 600M parameters. Their 300M model achieves state-of-the-art dialect identification performance using fewer parameters than competing Whisper-based systems. This work helps close the gap for underrepresented Arabic dialects in speech technology.
The paper presents a valuable resource for Arabic speech processing by releasing the first large-scale multi-dialectal Arabic SSL models alongside a substantial crawled dataset. The core finding—that targeted pre-training on Arabic dialects outperforms generic multilingual models on dialect identification—is compelling and well-demonstrated. However, erratic scaling behavior undermines the contribution: the 600M model significantly underperforms the 300M model on dialect identification, and the 300M model fails to converge on the larger combined dataset. These issues suggest fundamental training instabilities that are not adequately explained.
The 300M model's dialect identification results are impressive, achieving 96.02% accuracy on ADI-20 test set versus 94.83% for Whisper-large while using under half the parameters. On ASR, the model consistently outperforms same-sized baselines (HuBERT-large, XLS-R-128) across Egyptian, Moroccan, and Tunisian benchmarks with WER reductions of 10+ points on MGB-3 and MGB-5. The data curation methodology is thorough, with explicit filtering for offensive content and verification that dialect tags are not sourced from unreliable YouTube geotags.
The scaling results are inverted and unexplained. The 600M model degrades to 91.05% accuracy on DID test compared to the 300M model's 96.02%, and remarkably w2v-BERT 2.0 using the same recipe did not converge at all. The authors acknowledge that "the 600M variants do not perform as well" but offer no explanation for why doubling parameters would harm performance. Similarly, the 300M model trained on the combined dataset fails to converge (validation loss 6.61 versus 3.81 on crawled data), suggesting the model cannot handle the diversity of the larger corpus. These failures indicate the training regime may not be robust across scales, yet the paper compares the best-performing 300M model against baselines without adequately contextualizing these instabilities.
The comparisons to HuBERT-large and XLS-R-128 are fair, using identical fine-tuning recipes with three-layer feedforward networks. However, the w2v-BERT 2.0 comparison uses a linear probe which "provides better performance" rather than the same architecture, potentially confounding the comparison. While the DID evaluation is rigorous using the standard ADI-20 benchmark, the ASR evaluation aggregates results across MSA (Common Voice) and dialectal datasets (MGB-3, MGB-5, TARIC-SLU), making it unclear whether the model specifically excels at dialectal speech or Arabic ASR generally. The paper claims "family-targeted pre-training... significantly improves downstream performance" but does not isolate the effect of dialectal data versus simply more Arabic data.
The authors pledge to release "models, code, and pre-processed datasets" at https://github.com/elyadata/Ara-BEST-RQ. Training details are reasonably complete: masking probability $p=0.15$ with mask length 4 frames (yielding ~60% masked frames), 4096 codebook entries, batch duration of 450 seconds, and hardware configurations (16×A100 for 300M, 32×H100 for 600M). However, critical hyperparameters including learning rate, optimizer, and training steps are omitted. The preprocessing pipeline using Silero VAD with 250ms segment merging is well-documented. The failure modes—300M not converging on combined data and 600M underperforming on DID—suggest significant hyperparameter sensitivity that future reproducers will need to navigate without explicit guidance from the authors.
We present Ara-BEST-RQ, a family of self-supervised learning (SSL) models specifically designed for multi-dialectal Arabic speech processing. Leveraging 5,640 hours of crawled Creative Commons speech and combining it with publicly available datasets, we pre-train conformer-based BEST-RQ models up to 600M parameters. Our models are evaluated on dialect identification (DID) and automatic speech recognition (ASR) tasks, achieving state-of-the-art performance on the former while using fewer parameters than competing models. We demonstrate that family-targeted pre-training on Arabic dialects significantly improves downstream performance compared to multilingual or monolingual models trained on non-Arabic data. All models, code, and pre-processed datasets will be publicly released to support reproducibility and further research in Arabic speech technologies.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.