Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation

cs.CL cs.AI cs.SD Tianle Yang, Chengzhe Sun, Phil Rose, Cassandra L. Jacobs, Siwei Lyu · Mar 22, 2026

What it does

Why it matters

The authors propose a segmental-level prosodic probing framework comparing Tacotron 2 and FastSpeech 2 against natural speech, stratifying by lexical frequency to test memorization versus abstraction. This matters because TTS evaluation...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper investigates whether neural text-to-speech systems capture consonant-induced F0 perturbation—fine-grained phonetic effects where voiceless obstruents raise and voiced obstruents lower fundamental frequency relative to sonorants. The authors propose a segmental-level prosodic probing framework comparing Tacotron 2 and FastSpeech 2 against natural speech, stratifying by lexical frequency to test memorization versus abstraction. This matters because TTS evaluation often misses sub-phonemic articulatory detail that distinguishes human-like phonetic competence from surface pattern matching.

Critical review

Verdict

Bottom line

The paper presents compelling evidence that current TTS architectures rely on lexical memorization rather than abstract phonetic encoding, though its scope is limited to two architectures and a single speaker corpus in the primary experiment. The finding that both autoregressive and non-autoregressive models fail to generalize perturbation effects to low-frequency words is robust and well-supported, but broader claims about neural TTS systems would benefit from testing more diverse architectures such as diffusion or flow-based models.

“Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization rather than abstract segmental-prosodic encoding.”

paper · Abstract

What holds up

The frequency-stratified design effectively distinguishes memorization from generalization, showing that both models reproduce expected F0 perturbation patterns (voiceless > sonorant > voiced) for high-frequency words while failing for low-frequency items. The use of sonorants as a baseline aligns with established phonetic practice, and the GAMM statistical approach appropriately handles time-varying trajectories with proper autocorrelation modeling.

“For high-frequency words, repeated exposure allows the model to reproduce stable and separable onset-conditioned f0 patterns.”

paper · Section 6

“In the high-frequency condition, both the natural recordings (LJ Speech) and the synthetic outputs produced by the TTS models (FastSpeech 2 and Tacotron 2) exhibit similar f0 perturbation patterns.”

paper · Section 4

Main concerns

The paper tests only two TTS architectures trained on a single-speaker corpus (LJ Speech), limiting generalizability to modern diffusion or flow-based models. While the frequency-based proxy for training exposure is practical, it introduces noise—though the direct seen/unseen comparison mitigates this. Most critically, the study lacks perceptual validation; it assumes F0 perturbation absence degrades naturalness without demonstrating human listeners actually notice these deviations.

The authors acknowledge this gap, noting that establishing causal links requires targeted experiments, but this leaves the practical impact of the findings uncertain. Additionally, the large-scale Experiment 2 uses the In-the-Wild dataset with unknown model training conditions, weakening claims about architectural generalization.

“FastSpeech 2 fails to produce any systematic difference in f0 contours or height across the three onset types, suggesting a lack of learned segmental-prosodic distinctions in these unseen items.”

paper · Section 4

“Establishing a direct causal link between segmental-level deviations and listener judgments would require targeted perceptual experiments, such as controlled listening tests or systematic parameter manipulations.”

paper · Section 6

Evidence and comparison

The paper accurately situates its findings within phonetics literature, correctly noting that voiceless obstruents raise F0 while voiced obstruents show variable effects. The comparison between Tacotron 2 and FastSpeech 2 reveals architectural differences in error patterns without either succeeding at generalization. However, the claim that these results generalize to TTS architectures broadly rests on a single follow-up experiment using unclear training conditions, weakening the universal claim.

“The slight difference between the generation results of the two models might be due to the potential role of explicit duration and pitch predictors in FastSpeech 2, which might smooth or oversimplify local prosodic dynamics.”

paper · Section 6

“While the overall onset-conditioned f0 patterns remain observable in the In-the-Wild dataset, the effects appear less obvious than those found in the single-speaker LJ Speech corpus.”

paper · Section 5

Reproducibility

Reproduction is feasible given the authors use publicly available datasets (LJ Speech, COCA, In-the-Wild) and pretrained models with MFA alignment. However, exact training hyperparameters for replication from scratch are underspecified. The computational cost of the GAMM analysis with word-level factor smooths required subsampling (1,000 tokens per condition), which the authors transparently report but which limits granularity for the low-frequency group where words often had only one or two tokens.

“All the speech corpus, lexical corpus, and speech models used in this study are publicly available online, and we have provided corresponding references in the experimental setting section for the purpose of reproduction.”

paper · Appendix A

“Incorporating all the words in a corpus as a factor-smooth term is computationally intractable and unnecessary. Each level of the factor requires estimating a separate smooth, which results in high memory consumption and increased model complexity.”

paper · Section 2.3

Abstract

This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models' ability to reproduce consonant-induced f0 perturbation, a fine-grained segmental-prosodic effect that reflects local articulatory mechanisms. We compare synthetic and natural speech realizations for thousands of words, stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same speech corpus (LJ Speech). These controlled analyses are then complemented by a large-scale evaluation spanning multiple advanced TTS systems. Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization than on abstract segmental-prosodic encoding. This finding highlights a limitation in such TTS systems' ability to generalize prosodic detail beyond seen data. The proposed probe offers a linguistically informed diagnostic framework that may inform future TTS evaluation methods, and has implications for interpretability and authenticity assessment in synthetic speech.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.