SqueezeComposer: Temporal Speed-up is A Simple Trick for Long-form Music Composing

eess.AS cs.CL cs.SD Jianyi Chen, Rongxiu Zhong, Shilei Zhang, Kun Qian, Jinglei Liu, Yike Guo, Wei Xue · Mar 22, 2026
Local to this browser
What it does
This paper proposes SqueezeComposer, a long-form music generation framework that tackles computational constraints by applying temporal speed-up (e. g.
Why it matters
The core idea is to generate music in an accelerated domain using diffusion models, then restore it to normal speed, theoretically enabling models to produce 10+ minute compositions with fixed memory budgets. The approach is tested on...
Main concern
The temporal speed-up trick is pragmatic and delivers impressive efficiency gains, achieving real-time factors (RTF) of ~0. 08 compared to ~10.
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper proposes SqueezeComposer, a long-form music generation framework that tackles computational constraints by applying temporal speed-up (e.g., 2×, 4×, 8×) to compress audio sequences before generation. The core idea is to generate music in an accelerated domain using diffusion models, then restore it to normal speed, theoretically enabling models to produce 10+ minute compositions with fixed memory budgets. The approach is tested on continuation, completion, and singing accompaniment tasks.

Critical review
Verdict
Bottom line

The temporal speed-up trick is pragmatic and delivers impressive efficiency gains, achieving real-time factors (RTF) of ~0.08 compared to ~10.5 for PyramidCodec on continuation tasks. However, the central hypothesis—that AI models genuinely "understand" accelerated audio as an abstraction—remains an untested assumption rather than an established fact. The method trades reconstruction fidelity for computational scalability, particularly at higher compression ratios (8×), where Mel distance degrades significantly (4.58 vs 1.21).

“SqueezeComposer_x4 (continuation) ... RTF 0.078 ... PyramidCodec ... RTF 10.490”
paper · Table 3
“BigVGAN-Squeeze-8 ... Mel_dis 4.5814 ... BigVGAN ... Mel_dis 1.2115”
paper · Table 1
What holds up

The efficiency gains are substantial and well-documented. The framework successfully generates music up to 600 seconds (10 minutes) with an RTF of 0.037 using 8× squeezing, while competitors like MusicGen are limited to 15-second clips. The hierarchical paradigm—treating accelerated audio as coarse structure—intuitively aligns with music composition principles. The compatibility with off-the-shelf vocoders after fine-tuning (Table 2) demonstrates practical deployability without requiring full pipeline retraining.

“SqueezeComposer_x8 ... Duration 600s ... RTF 0.037 ... MusicGen ... Duration 15s”
paper · Table 4
“BigVGAN-Squeeze-4(⋆) ... Mel_dis 1.8367 ... effectively closing the gap introduced by acceleration”
paper · Table 2
Main concerns

The out-of-domain generalization is weak: on MUSDB18, SqueezeComposer achieves FAD of 4.1104 versus SingSong's 0.9084, indicating significant distribution shift when faced with unfamiliar audio. The claim that models "understand" accelerated audio is merely asserted; no perceptual or representational learning experiments validate this. At 8× compression, objective metrics degrade substantially (STFT distance 0.89 vs 0.50), suggesting the "simple trick" has hard limits on fidelity. The comparison with PyramidCodec is misleading since it uses autoregressive models (constrained by sequential inference) while SqueezeComposer uses parallel diffusion.

“SqueezeComposer ... FAD 4.1104 ... SingSong ... FAD 0.9084”
paper · Table 5
“BigVGAN-Squeeze-8 ... STFT_dis 0.8864 ... BigVGAN ... STFT_dis 0.5005”
paper · Table 1
“we hypothesize that AI models can understand and generate accelerated audio over time”
paper · Introduction
Evidence and comparison

The evidence strongly supports efficiency claims but offers mixed support for quality. While AudioBox-Aesthetics scores (CE, CU, PC, PQ) are competitive with PyramidCodec on in-domain data, the reconstruction distances in Table 1 reveal that squeezing introduces measurable distortion even with fine-tuning (Mel_dis 1.84 vs 1.21 at 4×). The baseline comparisons are fair for diffusion-based methods (AudioLDM, MusicLDM) but less so for autoregressive approaches where length limitations are architectural rather than methodological. The absence of ablation studies on intermediate speed-up ratios (e.g., comparing 2× vs 4× vs 8× on the same task) limits understanding of the speed-quality trade-off.

“SqueezeComposer_x4 ... CE 6.8499 ... PyramidCodec ... CE 6.6442”
paper · Table 3
“BigVGAN-Squeeze-4(⋆) ... Mel_dis 1.8367 ... Waveform_dis 0.1798”
paper · Table 2
Reproducibility

Reproducibility is significantly hampered by the reliance on a private dataset of 400,000 songs for training the accompaniment model, though Lakh MIDI and MUSDB18 are public. The paper omits critical architectural hyperparameters: exact DiT depth, channel dimensions, CNN prior architecture specifics, and training learning rates. No code repository is mentioned. The experimental protocol for the diffusion sampling (number of steps, scheduler type) is not specified, making independent reproduction of the RTF metrics difficult. The fine-tuning procedure for vocoders (dataset size, iterations) is also underspecified.

“Song Data: A private dataset of approximately 400,000 songs”
paper · Section 5.1
“The prior encoder uses 6 cross-attention layers, the DiT uses 8 layers, and the entire model is trained for 500k steps using 8 H800 GPUs”
paper · Section 4.2
Abstract

Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time-accelerated (speeded-up) audio at rates such as 2x, 4x, or even 8x. By first generating a high-speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long-form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed-up and slow-down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long-form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long-form music generation, which evaluates temporal-wise control (including continuation, completion, and generation from scratch), and whole-song singing accompaniment generation, which evaluates track-wise control. Experimental results demonstrate that our simple temporal speed-up trick enables efficient, scalable, and high-quality long-form music generation. Audio samples are available at https://SqueezeComposer.github.io/.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.