TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild

cs.CL cs.LG eess.AS Kai-Wei Chang, Yi-Cheng Lin, Huang-Cheng Chou, Wenze Ren, Yu-Han Huang, Yun-Shao Tsai, Chien-Cheng Chen, Yu Tsao, Yuan-Fu Liao, Shrikanth Narayanan, James Glass, Hung-yi Lee · Mar 23, 2026

What it does

Why it matters

With 3,000+ utterances from 21 elderly speakers across emergency and smart-home scenarios, it addresses a critical gap in speech technology for aging populations. The authors also propose keyword-based and audio-visual mining strategies to...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper introduces TaigiSpeech, the first intent recognition dataset for Taiwanese Hokkien—a low-resource language spoken by 65% of Taiwanese elders. With 3,000+ utterances from 21 elderly speakers across emergency and smart-home scenarios, it addresses a critical gap in speech technology for aging populations. The authors also propose keyword-based and audio-visual mining strategies to bootstrap training data from unlabeled video sources.

Critical review

Verdict

Bottom line

TaigiSpeech represents a valuable contribution to low-resource spoken language understanding, specifically targeting the underserved elderly demographic in Taiwan. However, the dataset's small scale (21 speakers, ~6 hours) and the severe domain mismatch between mined in-the-wild data and real-world recordings (performance drops of 20+ percentage points) highlight the challenges of the proposed approach. While the audio-visual mining strategy shows promise for scalable data collection, its near-random performance on direct transfer to the target domain limits its immediate practical utility.

“Experimental results show a substantial performance degradation when models trained on mined in-the-wild data are evaluated on real-world elderly recordings, indicating a significant domain mismatch.”

paper · Abstract

“under the simplified two-class setting, the model may fail to learn robust semantic meaning, instead relying on superficial patterns that do not generalize across domains”

paper · Section 6.1

What holds up

The dataset design is methodologically sound. The authors employ imagined scenarios and optional video stimuli to elicit naturalistic speech from elderly participants, capturing genuine acoustic characteristics of distress and functional commands. The paper's honest reporting of domain mismatch issues—showing that models trained on mined drama data degrade significantly on elderly recordings—provides important empirical evidence for the field. The comprehensive comparison with existing corpora (Tables 2 and 3) effectively contextualizes TaigiSpeech as the only dataset combining low-resource language, elderly speakers, and emergency scenarios.

“Participants are encouraged to respond freely rather than read fixed scripts to increase linguistic diversity and better reflect real-world usage.”

paper · Section 3.2

“This finding underscores the necessity of TaigiSpeech as a realistic benchmark for evaluating and advancing practical spoken intent recognition systems in Taiwanese Hokkien.”

paper · Section 6.1

Main concerns

The primary limitation is scale: with only 21 speakers and ~3,000 utterances, the dataset is small compared to high-resource alternatives like SLURP or Speech Commands. The data mining strategies reveal significant challenges: keyword mining produces highly imbalanced intent distributions (e.g., only 55 true positives for CANCEL_ALERT out of 946 retrieved segments), while audio-visual mining achieves near-chance accuracy (~50-54%) on binary emergency detection when transferred to the target domain without fine-tuning. The paper notes that fine-grained intent classification remains challenging under audio-visual mining, effectively reducing it to binary classification rather than the full 8-intent task.

“For example, under Keyword Match Mining, the top-performing WavLM-large model drops from 92.36% on Drama to 70.00% on TaigiSpeech”

paper · Section 6.1

“we found that under the audio-visual mining setting, obtaining fine-grained emergency labels (e.g., distinguishing breath-related emergencies from fall events) remains challenging”

paper · Section 4.2

Evidence and comparison

The authors provide thorough comparisons with existing spoken intent datasets (Table 2) and Taiwanese Hokkien corpora (Table 3), clearly positioning their contribution as the first targeting elderly speakers with scenario-driven expressive speech. The experimental evidence supports the claim that SSL models (HuBERT/WavLM) substantially outperform lightweight models (MatchboxNet) and cascaded ASR+LLM approaches (Table 9). However, the analysis of why audio-visual mining fails to generalize—beyond hypothesizing about superficial patterns—remains limited, and the bootstrapped confidence intervals suggest high variance in some conditions.

“Experimental results indicate that this cascaded ASR+LLM approach achieves reasonable performance. However, it still requires substantially larger model parameters and computational resources, and its performance remains inferior to end-to-end speech classification models such as HuBERT and WavLM.”

paper · Section 6.4

Reproducibility

The dataset will be released under CC BY 4.0, and the paper provides detailed appendices describing recording environments, text prompt generation, and scenario video prompts. The use of standard architectures (MatchboxNet, HuBERT, WavLM) and clear train/val/test splits (Table 8) facilitates reproducibility. However, the implementation details of the mining pipeline (specific LLM prompts for pseudo-labeling, exact PE-AV retrieval thresholds) are not fully specified, and the paper acknowledges using a smaller-scale exploration of 28k clips rather than the full 7,000-hour corpus due to computational constraints.

“TaigiSpeech will be released under the CC BY 4.0 license to facilitate broad adoption and research on low-resource and unwritten languages.”

paper · Abstract

“Due to computational constraints, rather than directly utilizing the full 7,000 hours of the drama dataset, we conduct a smaller-scale exploration using a dataset derived from keyword-matched queries”

paper · Section 5.1

Abstract

Speech technologies have advanced rapidly and serve diverse populations worldwide. However, many languages remain underrepresented due to limited resources. In this paper, we introduce \textbf{TaigiSpeech}, a real-world speech intent dataset in Taiwanese Taigi (aka Taiwanese Hokkien/Southern Min), which is a low-resource and primarily spoken language. The dataset is collected from older adults, comprising 21 speakers with a total of 3k utterances. It is designed for practical intent detection scenarios, including healthcare and home assistant applications. To address the scarcity of labeled data, we explore two data mining strategies with two levels of supervision: keyword match data mining with LLM pseudo labeling via an intermediate language and an audio-visual framework that leverages multimodal cues with minimal textual supervision. This design enables scalable dataset construction for low-resource and unwritten spoken languages. TaigiSpeech will be released under the CC BY 4.0 license to facilitate broad adoption and research on low-resource and unwritten languages. The project website and the dataset can be found on https://kwchang.org/taigispeech.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.