SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis
Remote sensing text-to-image generation suffers from a lack of domain-specific diffusion transformers and prohibitive costs for high-resolution training. Existing training-free resolution promotion methods apply static RoPE scaling that uniformly compresses the spatial spectrum, which is particularly harmful for RS imagery due to its characteristically denser high-frequency energy. This paper proposes SHARP, a spectrum-aware dynamic adaptation strategy that uses a rational decay scheduler $\kappa_{rs}(t)$ to apply strong positional extrapolation early in denoising (for layout formation) while progressively relaxing it later (for detail recovery). The approach enables robust multi-scale generation up to 2.5$\times$ extrapolation factors with negligible overhead, addressing a critical gap in large-scale RS synthesis.
The paper presents a well-motivated and empirically validated solution to an important domain-specific problem. The core insight—that static RoPE extrapolation disproportionately suppresses the high-frequency content critical to RS realism—is supported by both spectral analysis and frequency-progressive denoising theory. The proposed rational scheduler is simple yet effective, and the comprehensive multi-scale evaluation demonstrates consistent gains over training-free baselines. However, the reliance on reference-free metrics without downstream task validation leaves open questions about the practical utility of the synthesized imagery for actual RS applications.
The spectral analysis section provides compelling empirical evidence that RS imagery contains systematically stronger medium- and high-frequency energy than natural images, validating the domain-specific motivation. The mathematical derivation of the frequency-progressive denoising behavior ($\rho(f,t)$ and $t_c(f)$) offers a principled foundation for the time-dependent scheduling strategy. The resolution-agnostic formulation is particularly elegant, using dimensionless ratios $r(d)=L_{\mathrm{target}}/\lambda_d$ and normalized timesteps to enable multi-scale generation from a single hyperparameter set.
The evaluation relies exclusively on automated metrics (CLIP Score, Aesthetic Score, HPSv2) which may not capture RS-specific geometric fidelity requirements; the authors acknowledge that downstream task validation is missing. The rational scheduler $\kappa_{rs}(t)$, while effective, remains a hand-designed heuristic with limited theoretical justification for its specific functional form beyond empirical tuning. Additionally, the maximum tested extrapolation factor of 2.5$\times$ (2560$\times$2560) is modest compared to the 4$\times$ or 8$\times$ targets pursued by some natural-image resolution promotion work, leaving questions about scalability to very high resolutions where attention costs become prohibitive.
The evidence strongly supports the claim that dynamic scheduling outperforms static baselines (PI, NTK-aware, YaRN) across six resolutions, with margins widening as extrapolation factors increase. The ablation studies rigorously validate each component, showing that fine-tuning and SHARP provide complementary gains and that the rational form specifically outperforms linear and cosine alternatives. However, the paper does not compare against training-based resolution promotion methods or state-of-the-art RS super-resolution approaches, making it difficult to assess the absolute quality ceiling. The claim of 'resolution-agnostic' behavior is well-supported by the rectangular resolution experiments and multi-scale consistency visualizations.
Reproducibility is generally strong: the code and model weights are publicly available, the dataset construction pipeline from GeoChat using Qwen-VL is clearly described, and all hyperparameters ($\alpha_s=3$, $\alpha=1$, $\beta=32$) are disclosed. The training configuration (AdamW, lr=$1\times 10^{-5}$, 10K steps, batch size 64) is specified. However, the exact prompt set used for evaluation is generated by GPT-5.4 and may not be fully reproducible if the model version or sampling parameters change. While the computational overhead is negligible (<1.5%), independent reproduction requires significant GPU resources for the RS-FLUX fine-tuning (8$\times$A6000 GPUs).
Text-to-image generation powered by Diffusion Transformers (DiTs) has made remarkable strides, yet remote sensing (RS) synthesis lags behind due to two barriers: the absence of a domain-specialized DiT prior and the prohibitive cost of training at the large resolutions that RS applications demand. Training-free resolution promotion via Rotary Position Embedding (RoPE) rescaling offers a practical remedy, but every existing method applies a static positional scaling rule throughout the denoising process. This uniform compression is particularly harmful for RS imagery, whose substantially denser medium- and high-frequency energy encodes the fine structures critical for aerial-scene realism, such as vehicles, building contours, and road markings. Addressing both challenges requires a domain-specialized generative prior coupled with a denoising-aware positional adaptation strategy. To this end, we fine-tune FLUX on over 100,000 curated RS images to build a strong domain prior (RS-FLUX), and propose Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion (SHARP), a training-free method that introduces a rational fractional time schedule k_rs(t) into RoPE. SHARP applies strong positional promotion during the early layout-formation stage and progressively relaxes it during detail recovery, aligning extrapolation strength with the frequency-progressive nature of diffusion denoising. Its resolution-agnostic formulation further enables robust multi-scale generation from a single set of hyperparameters. Extensive experiments across six square and rectangular resolutions show that SHARP consistently outperforms all training-free baselines on CLIP Score, Aesthetic Score, and HPSv2, with widening margins at more aggressive extrapolation factors and negligible computational overhead. Code and weights are available at https://github.com/bxuanz/SHARP.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.