Efficient Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution
Diffusion models generate high-quality images but require hundreds of denoising steps, making deployment on edge devices impractical. This paper proposes Coarse-to-Fine Diffusion Models that start with low-resolution denoising early in the process (when outputs are noisy anyway) before switching to high-resolution, plus a fast time-step search method that finds good sampling schedules in under 10 minutes instead of days.
The paper presents two sound ideas—progressive resolution denoising and efficient time-step search—that together achieve meaningful efficiency gains. However, the evaluation is limited to small-scale unconditional generation, and claims of "near-lossless" performance with 80-90% computation reduction conflate multiple techniques (C2F, TRD, and reduced step counts) which obscures individual contributions. The work is technically competent but incomplete regarding comparisons with modern alternatives like consistency models, adversarial distillation, or latent diffusion approaches that address the same problem.
The core insight that early denoising steps produce indistinguishable coarse features is well-supported by visual evidence and PCA rank analysis across resolution transitions. The observation that "ranks initially decrease and then enhance during the denoising process" provides a principled way to select the high-resolution transition point without brute-force search. The TRD method's speed—under 10 minutes versus "more than one day" for evolutionary search—is a genuine practical improvement enabled by using L2 loss on small calibration sets rather than FID evaluation.
The 80-90% computation reduction figure bundles C2F, step reduction via TRD, and the quadratic MACs savings from lower resolution inputs, making it impossible to assess C2F's standalone contribution. The evaluation is extremely narrow: only unconditional generation on CIFAR10 (32×32) and LSUN-Church (256×256) are tested, with no text-conditional results, no comparison to consistency models, or adversarial distillation methods that achieve single-step generation. The FID improvements from TRD are modest in absolute terms—Fig. 6 shows overlapping error bars for many configurations. No ablation isolates TRD's benefit independently of C2F. The paper also lacks discussion of failure cases—when does coarse-to-fine fail? Are there class or content dependencies?
The comparison to Diff-Pruning in Fig. 6 is favorable but insufficient—Diff-Pruning is a model compression method from 2022, not a step-reduction technique. The paper omits comparisons to more relevant contemporaries: consistency models (which enable single-step sampling), progressive distillation, or DiT-style architectures that scale efficiently. The claim that TRD "preserves image quality in fewer steps" is only validated against uniform step sequences, not learned schedulers or ODE solvers designed for few-step generation. The multi-resolution fine-tuning ablation in Table I shows their 1DM-label strategy achieves FID 4.91 versus 4.46 for the original model—a small but meaningful gap that is not discussed.
Critical resources for reproduction are missing: no code repository link, no pretrained model checkpoints, and no exact hyperparameter specifications for reproducible training. While Section IV mentions "fine-tune 100k and 375k iterations," learning rates, batch sizes, and optimizer settings are unspecified. The calibration set for TRD is stated as size 16 but its composition (random samples? class-balanced?) is not described. Multi-resolution fine-tuning requires architecture modifications (resolution labels added to time embeddings) that are only conceptually described. Without implementation details, independent reproduction would require substantial re-engineering.
Recently, diffusion models (DMs) have made significant strides in high-quality image generation. However, the multi-step denoising process often results in considerable computational overhead, impeding deployment on resource-constrained edge devices. Existing methods mitigate this issue by compressing models and adjusting the time step sequence. However, they overlook input redundancy and require lengthy search times. In this paper, we propose Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution. Recognizing indistinguishable early-stage generated images, we introduce Coarse-to-Fine Denoising (C2F) to reduce computation during coarse feature generation. Furthermore, we design Time Step Sequence Redistribution (TRD) for efficient sampling trajectory adjustment, requiring less than 10 minutes for search. Experimental results demonstrate that the proposed methods achieve near-lossless performance with an 80% to 90% reduction in computation on CIFAR10 and LSUN-Church.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.