TaigiSpeech: A Low-Resource Real-World Speech Intent Dataset and Preliminary Results with Scalable Data Mining In-the-Wild
This paper introduces TaigiSpeech, the first intent recognition dataset for Taiwanese Hokkien—a low-resource language spoken by 65% of Taiwanese elders. With 3,000+ utterances from 21 elderly speakers across emergency and smart-home scenarios, it addresses a critical gap in speech technology for aging populations. The authors also propose keyword-based and audio-visual mining strategies to bootstrap training data from unlabeled video sources.
TaigiSpeech represents a valuable contribution to low-resource spoken language understanding, specifically targeting the underserved elderly demographic in Taiwan. However, the dataset's small scale (21 speakers, ~6 hours) and the severe domain mismatch between mined in-the-wild data and real-world recordings (performance drops of 20+ percentage points) highlight the challenges of the proposed approach. While the audio-visual mining strategy shows promise for scalable data collection, its near-random performance on direct transfer to the target domain limits its immediate practical utility.
The dataset design is methodologically sound. The authors employ imagined scenarios and optional video stimuli to elicit naturalistic speech from elderly participants, capturing genuine acoustic characteristics of distress and functional commands. The paper's honest reporting of domain mismatch issues—showing that models trained on mined drama data degrade significantly on elderly recordings—provides important empirical evidence for the field. The comprehensive comparison with existing corpora (Tables 2 and 3) effectively contextualizes TaigiSpeech as the only dataset combining low-resource language, elderly speakers, and emergency scenarios.
The primary limitation is scale: with only 21 speakers and ~3,000 utterances, the dataset is small compared to high-resource alternatives like SLURP or Speech Commands. The data mining strategies reveal significant challenges: keyword mining produces highly imbalanced intent distributions (e.g., only 55 true positives for CANCEL_ALERT out of 946 retrieved segments), while audio-visual mining achieves near-chance accuracy (~50-54%) on binary emergency detection when transferred to the target domain without fine-tuning. The paper notes that fine-grained intent classification remains challenging under audio-visual mining, effectively reducing it to binary classification rather than the full 8-intent task.
The authors provide thorough comparisons with existing spoken intent datasets (Table 2) and Taiwanese Hokkien corpora (Table 3), clearly positioning their contribution as the first targeting elderly speakers with scenario-driven expressive speech. The experimental evidence supports the claim that SSL models (HuBERT/WavLM) substantially outperform lightweight models (MatchboxNet) and cascaded ASR+LLM approaches (Table 9). However, the analysis of why audio-visual mining fails to generalize—beyond hypothesizing about superficial patterns—remains limited, and the bootstrapped confidence intervals suggest high variance in some conditions.
The dataset will be released under CC BY 4.0, and the paper provides detailed appendices describing recording environments, text prompt generation, and scenario video prompts. The use of standard architectures (MatchboxNet, HuBERT, WavLM) and clear train/val/test splits (Table 8) facilitates reproducibility. However, the implementation details of the mining pipeline (specific LLM prompts for pseudo-labeling, exact PE-AV retrieval thresholds) are not fully specified, and the paper acknowledges using a smaller-scale exploration of 28k clips rather than the full 7,000-hour corpus due to computational constraints.
Speech technologies have advanced rapidly and serve diverse populations worldwide. However, many languages remain underrepresented due to limited resources. In this paper, we introduce \textbf{TaigiSpeech}, a real-world speech intent dataset in Taiwanese Taigi (aka Taiwanese Hokkien/Southern Min), which is a low-resource and primarily spoken language. The dataset is collected from older adults, comprising 21 speakers with a total of 3k utterances. It is designed for practical intent detection scenarios, including healthcare and home assistant applications. To address the scarcity of labeled data, we explore two data mining strategies with two levels of supervision: keyword match data mining with LLM pseudo labeling via an intermediate language and an audio-visual framework that leverages multimodal cues with minimal textual supervision. This design enables scalable dataset construction for low-resource and unwritten spoken languages. TaigiSpeech will be released under the CC BY 4.0 license to facilitate broad adoption and research on low-resource and unwritten languages. The project website and the dataset can be found on https://kwchang.org/taigispeech.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.