SLURP-TN : Resource for Tunisian Dialect Spoken Language Understanding
SLURP-TN introduces a Spoken Language Understanding (SLU) dataset for Tunisian Arabic, a low-resource dialect. The authors translate and record six domains from the English SLURP corpus with 55 speakers across 18 geographic regions, emphasizing gender balance and code-switching phenomena. The dataset provides approximately five hours of audio across three acoustic conditions (clean, noisy, headphone) to enable robust benchmarking of ASR and SLU systems for dialectal Arabic.
The paper presents a solid methodological contribution for Tunisian Arabic SLU, with careful attention to speaker diversity and acoustic variability. However, the dataset covers only six of eighteen original SLURP domains and relies on a single annotator for semantic labels, raising concerns about annotation consistency and limited domain generalization. The benchmarking results convincingly demonstrate that code-switching and dialectal variation remain challenging for current SoTA models, with w2v-BERT 2.0 achieving 33.6% WER and SENSE reaching 59.0% CVER on the test set.
The dataset design exhibits thoughtful demographic balancing: speakers represent eighteen distinct Tunisian regions with nearly equal gender distribution (50.4% male, 49.6% female in training). The inclusion of code-switching—present in 54% of utterances—reflects authentic Tunisian speech patterns, and the three acoustic conditions (clean, noisy, headphone) enable robustness testing. The public release via HuggingFace and detailed metadata (age, gender, region) facilitate reproducible research.
The scope is limited: only six of eighteen SLURP domains are covered, representing roughly one-third of the original corpus, which constrains generalization to unseen intents. More critically, SLU annotation was performed by a single professional annotator without reported inter-annotator agreement (IAA) metrics, making it impossible to assess label reliability or consistency. The training set is modest (2 hours 46 minutes, 2677 segments), potentially insufficient for training robust neural SLU systems from scratch without heavy reliance on pre-trained models.
The experimental evidence supports the claim that semantic enrichment trades off against transcription accuracy: SENSE outperforms w2v-BERT 2.0 on semantic metrics (CoER/CVER) but lags on WER/CER, consistent with findings from the authors' prior work. However, the paper omits direct experimental comparison with TARIC-SLU—the only existing Tunisian SLU corpus—despite citing it as related work and highlighting its single-domain limitation. Comparisons to SpeechMASSIVE's Arabic subset (only 115 training utterances) correctly highlight SLURP-TN's relative scale, though the absolute size remains small compared to English SLU resources.
The experimental setup is generally reproducible: authors use the open-source SpeechBrain toolkit, provide hyperparameters (Adam with $10^{-5}$ lr for encoders, Adadelta with 1.0 lr for downstream layers, batch size 16, 250 epochs), and release both dataset and models on HuggingFace. However, critical details are missing: random seeds are not specified, exact training duration is not reported, and hardware specifications beyond GPU type (H100/A100) are absent. The single-annotator setup also means replication of the semantic annotation phase is impossible without the original annotator's guidelines or IAA metrics.
Spoken Language Understanding (SLU) aims to extract the semantic information from the speech utterance of user queries. It is a core component in a task-oriented dialogue system. With the spectacular progress of deep neural network models and the evolution of pre-trained language models, SLU has obtained significant breakthroughs. However, only a few high-resource languages have taken advantage of this progress due to the absence of SLU resources. In this paper, we seek to mitigate this obstacle by introducing SLURP-TN. This dataset was created by recording 55 native speakers uttering sentences in Tunisian dialect, manually translated from six SLURP domains. The result is an SLU Tunisian dialect dataset that comprises 4165 sentences recorded into around 5 hours of acoustic material. We also develop a number of Automatic Speech Recognition and SLU models exploiting SLUTP-TN. The Dataset and baseline models are available at: https://huggingface.co/datasets/Elyadata/SLURP-TN.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.