Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation
This paper investigates whether neural text-to-speech systems capture consonant-induced F0 perturbation—fine-grained phonetic effects where voiceless obstruents raise and voiced obstruents lower fundamental frequency relative to sonorants. The authors propose a segmental-level prosodic probing framework comparing Tacotron 2 and FastSpeech 2 against natural speech, stratifying by lexical frequency to test memorization versus abstraction. This matters because TTS evaluation often misses sub-phonemic articulatory detail that distinguishes human-like phonetic competence from surface pattern matching.
The paper presents compelling evidence that current TTS architectures rely on lexical memorization rather than abstract phonetic encoding, though its scope is limited to two architectures and a single speaker corpus in the primary experiment. The finding that both autoregressive and non-autoregressive models fail to generalize perturbation effects to low-frequency words is robust and well-supported, but broader claims about neural TTS systems would benefit from testing more diverse architectures such as diffusion or flow-based models.
The frequency-stratified design effectively distinguishes memorization from generalization, showing that both models reproduce expected F0 perturbation patterns (voiceless > sonorant > voiced) for high-frequency words while failing for low-frequency items. The use of sonorants as a baseline aligns with established phonetic practice, and the GAMM statistical approach appropriately handles time-varying trajectories with proper autocorrelation modeling.
The paper tests only two TTS architectures trained on a single-speaker corpus (LJ Speech), limiting generalizability to modern diffusion or flow-based models. While the frequency-based proxy for training exposure is practical, it introduces noise—though the direct seen/unseen comparison mitigates this. Most critically, the study lacks perceptual validation; it assumes F0 perturbation absence degrades naturalness without demonstrating human listeners actually notice these deviations.
The authors acknowledge this gap, noting that establishing causal links requires targeted experiments, but this leaves the practical impact of the findings uncertain. Additionally, the large-scale Experiment 2 uses the In-the-Wild dataset with unknown model training conditions, weakening claims about architectural generalization.
The paper accurately situates its findings within phonetics literature, correctly noting that voiceless obstruents raise F0 while voiced obstruents show variable effects. The comparison between Tacotron 2 and FastSpeech 2 reveals architectural differences in error patterns without either succeeding at generalization. However, the claim that these results generalize to TTS architectures broadly rests on a single follow-up experiment using unclear training conditions, weakening the universal claim.
Reproduction is feasible given the authors use publicly available datasets (LJ Speech, COCA, In-the-Wild) and pretrained models with MFA alignment. However, exact training hyperparameters for replication from scratch are underspecified. The computational cost of the GAMM analysis with word-level factor smooths required subsampling (1,000 tokens per condition), which the authors transparently report but which limits granularity for the low-frequency group where words often had only one or two tokens.
This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models' ability to reproduce consonant-induced f0 perturbation, a fine-grained segmental-prosodic effect that reflects local articulatory mechanisms. We compare synthetic and natural speech realizations for thousands of words, stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same speech corpus (LJ Speech). These controlled analyses are then complemented by a large-scale evaluation spanning multiple advanced TTS systems. Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization than on abstract segmental-prosodic encoding. This finding highlights a limitation in such TTS systems' ability to generalize prosodic detail beyond seen data. The proposed probe offers a linguistically informed diagnostic framework that may inform future TTS evaluation methods, and has implications for interpretability and authenticity assessment in synthetic speech.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.