TiCo: Time-Controllable Training for Spoken Dialogue Models

cs.CL cs.AI eess.AS Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass · Mar 23, 2026
Local to this browser
What it does
TiCo tackles a critical gap in spoken dialogue models: the inability to control response duration, which is essential for time-constrained scenarios like driving assistants or emergency healthcare. Unlike text length control, speech...
Why it matters
0 seconds> inserted during generation—to enable real-time temporal awareness. Using a two-stage post-training framework (self-generated supervised fine-tuning followed by reinforcement learning with verifiable rewards), TiCo equips models...
Main concern
TiCo presents an elegant and practical solution to duration control in spoken dialogue systems. The core mechanism of interleaving temporal markers into intermediate representations is well-motivated, and the two-stage training...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

TiCo tackles a critical gap in spoken dialogue models: the inability to control response duration, which is essential for time-constrained scenarios like driving assistants or emergency healthcare. Unlike text length control, speech duration depends on complex factors including phonetics, prosody, and speaking rate. The paper proposes Spoken Time Markers (STMs)—special tokens like <15.0 seconds> inserted during generation—to enable real-time temporal awareness. Using a two-stage post-training framework (self-generated supervised fine-tuning followed by reinforcement learning with verifiable rewards), TiCo equips models to estimate elapsed time and adjust content dynamically to meet target durations.

Critical review
Verdict
Bottom line

TiCo presents an elegant and practical solution to duration control in spoken dialogue systems. The core mechanism of interleaving temporal markers into intermediate representations is well-motivated, and the two-stage training pipeline—combining self-generated alignment data with GRPO using verifiable rewards—demonstrates strong empirical gains, reducing MAE from 13.01s to 4.54s on TiCo-Bench. However, the evaluation is limited to a single backbone architecture (Qwen2.5-Omni-7B), and the reliance on ASR-based alignment (Whisper medium) for training data construction introduces a potential bottleneck that could affect reproducibility on other domains.

“TiCo achieves the best overall performance with 4.54s MAE and 14.9% MAPE, substantially improving over its base model Qwen2.5-Omni-7B (13.01 s / 42.3%)”
Chang et al., Table 1 · Section 5.1
What holds up

The Spoken Time Marker mechanism effectively bridges the semantic-to-acoustic gap by providing explicit temporal checkpoints during generation. The Stage 1 self-generation approach is particularly clever: it avoids requiring new question-answer supervision by using the model's own outputs aligned via ASR, ensuring training distribution consistency. The Stage 2 RL formulation with the Gaussian reward function $\mathcal{R}_{\text{main}}^{(g)} = F(t_{\text{inst}} - t_{\text{last}}^{(g)})$ provides a direct, differentiable signal for duration accuracy. Furthermore, the auxiliary rewards (presence, monotonicity, repetition/copy penalties) demonstrate careful attention to RL failure modes like reward hacking and marker collapse.

“self-generation offers two advantages: (1) it removes the need for collecting paired question-answer supervision, and (2) the generated responses follow the model's own output distribution”
Chang et al., Sec. 3.1 · Section 3.1
“$\mathcal{R}_{\text{main}}^{(g)}=F\left(t_{\text{inst}}-t_{\text{last}}^{(g)}\right)$... We instantiate $F$ as a Gaussian function”
Chang et al., Sec. 3.2 · Section 3.2
Main concerns

The method assumes the existence of an intermediate textual representation (the 'Thinker' module), limiting applicability to SDM architectures that explicitly separate semantic planning from acoustic generation. While TiCo claims architecture-agnosticism, all experiments use only Qwen2.5-Omni-7B. The ASR-based alignment step (using Whisper medium) introduces a compounding error risk: timestamp inaccuracies directly corrupt STM supervision. Figure 5 reveals residual gaps between predicted and actual duration (e.g., 41.6s realized vs 40.0s target), indicating the speech generator's tempo may not perfectly match the Thinker's estimates. Additionally, the CHORD regularization, while stabilizing RL training, introduces sensitive hyperparameters ($\mu_{\text{peak}}=0.8$, valley $0.3$, 500-step decay) that may require extensive tuning for different model scales.

“The close alignment indicates that the final time marker accurately estimates realized speech duration”
Chang et al., Fig. 5 · Figure 5 caption
“CHORD interleaves SFT updates with GRPO updates using a mixing coefficient $\mu$ that decays from $\mu_{\text{peak}}=0.8$ to $\mu_{\text{valley}}=0.3$ over 500 steps”
Evidence and comparison

The evidence supports the core claim that TiCo improves duration controllability significantly. The TiCo-Bench evaluation is comprehensive, comparing against cascaded pipelines (GPT-5.2 + IndexTTS-2), commercial APIs (GPT-audio, Kimi Audio), and open-source baselines. The generalization experiments (Section 5.2) provide convincing evidence of cross-modal transfer (training on speech queries, testing on text) and length extrapolation (training on $\leq$41s, testing up to 60s). However, the GPT-score evaluation relies on a proprietary model (GPT-5-mini) which may introduce evaluation bias, and the benchmark's 1,440 samples, while diverse in source (InstructS2S, UROBench, LIFEBench), represents a limited domain compared to general conversational speech.

“TiCo consistently outperforms all baselines across both datasets... and both duration settings”
Chang et al., Table 1 · Section 5.1
“TiCo maintains consistently low MAE and MAPE across instructed-duration bins... TiCo can extrapolate its time-control capability to durations up to 1 minute”
Chang et al., Sec. 5.2 · Section 5.2
Reproducibility

Reproducibility is moderately strong but has gaps. The paper provides detailed training hyperparameters in Appendix B: LoRA configurations ($r=8, \alpha=16/32$), learning rates ($5\times 10^{-5}$ for SFT, $5\times 10^{-6}$ for GRPO), batch sizes (effective 32), and the GRPO clipping parameter $\varepsilon=0.2$. The use of open-source tools (MS-SWIFT, Whisper-timestamped) aids reproducibility. However, the paper does not explicitly mention code release. Reproduction requires substantial compute (4 NVIDIA A6000 GPUs) and access to specific proprietary components (IndexTTS-2 for cascaded baselines, GPT-5-mini for evaluation). The reliance on Whisper medium for timestamp generation in Stage 1 creates a dependency on external ASR quality that could vary across datasets.

“Stage 1... LoRA (r=8, $\alpha$=16)... peak $5\times 10^{-5}$... Stage 2... GRPO with CHORD... learning rate $5\times 10^{-6}$”
“Word-level timestamps for constructing Spoken Time Markers are obtained using Whisper medium”
Chang et al., Sec. 4.2 · Section 4.2
Abstract

We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., &#34;Please generate a response lasting about 15 seconds&#34;). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.