Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers

Abstract

Diffusion transformers (DiTs) offer excellent scalability for high-fidelity generation, but their computational overhead poses a great challenge for practical deployment. Existing acceleration methods primarily exploit the temporal dimension, whereas spatial acceleration remains underexplored. In this work, we investigate spatial acceleration for DiTs via latent upsampling. We found that naïve latent upsampling for spatial acceleration introduces artifacts, primarily due to aliasing in high-frequency edge regions and mismatching from noise-timestep discrepancies. Then, based on these findings and analyses, we propose a training-free spatial acceleration framework, dubbed Region-Adaptive Latent Upsampling (RALU), to mitigate those artifacts while achieving spatial acceleration of DiTs by our mixed-resolution latent upsampling. RALU achieves artifact-free, efficient acceleration with early upsampling only on artifact-prone edge regions and noise-timestep matching for different latent resolutions, leading to up to 7.0\(\times\) speedup on FLUX-1.dev and 3.0\(\times\) on Stable Diffusion 3 with negligible quality degradation. Furthermore, our RALU is complementarily applicable to existing temporal acceleration methods and timestep-distilled models, leading to up to 15.9\(\times\) speedup.

Visual Fidelity at 7\(\times\) Speedups

Generated 1024\(\times\)1024 images using acceleration methods on FLUX-1.dev for 7\(\times\) speedups. While temporal acceleration methods struggle with aggressive speedups and Bottleneck Sampling introduces artifacts, our RALU successfully accelerates while avoiding artifacts and maintaining high image quality.

Challenges in Spatial Acceleration for DiTs

Aliasing artifacts due to late upsampling

(a) An example of aliasing artifacts generated using FLUX-1.dev with 9 low-resolution steps, 2\(\times\) upsampling, and 9 full-resolution steps. (b) Edge energy and aliasing artifact ratio over image vs. upsampling timestep, averaged over 100 images.

Distribution mismatching artifacts

(a) An example of mismatching artifacts generated using FLUX-1.dev with early upsampling \(t_{up}=0.3\) and noise injection. (b) ImageReward score and mismatching artifact ratio vs. JSD, averaged over 100 images.

Region-Adaptive Latent Upsampling (RALU)

Overview of the proposed RALU framework. RALU consists of three different resolution processes: (1) low-resolution sampling for early denoising, (2) mixed-resolution sampling by upsampling edge region latents, and (3) full-resolution refinement by upsampling all remaining latents. (a) We select the top-\(r\) fraction of patches with the strongest edge signals from the decoded image and upsample them early. (b) We add correlated noise to the upsampled latents and design a corresponding timestep schedule.

Resolving artifacts from naive latent upsampling. Aliasing artifacts are avoided by (B) early upsampling, while Mismatching artifacts are mitigated by (C) noise and timestep matching.

Experiments

Quantitative Results

Quantitative comparisons of RALU with the baselines on (Top) FLUX.1-dev (FLUX) and (Bottom) Stable Diffusion 3 (SD3). Performance is evaluated with CLIP-IQA and NIQE for image quality, T2I-CompBench and GenEval for image-text alignment, and ImageReward for both. The number in parentheses next to FLUX indicates the total number of inference steps. ↑ / ↓ denotes that a higher / lower metric is favorable. Speedup (Speed.) is calculated relative to the base model FLOPs (floating-point operations). T and S denote the temporal and spatial acceleration, respectively. Additional computational metrics are reported in the supplementary material.

(Top) Quantitative results of integrating temporal acceleration methods into RALU under a 5\(\times\) speedup setting on FLUX. (Bottom) Quantitative results of adapting RALU on timestep-distilled models (FLUX.1-schnell, SD3.5L-Turbo). Speedups are measured relative to FLUX.1-dev and Stable Diffusion 3.5 Large, respectively. D denotes the timestep-distilled model. The \(W\) value of TaylorSeer denotes the number of warm-up steps.

Qualitative Results

Qualitative comparison of images generated by baseline methods and RALU on FLUX and SD3 under various speedups. For FLUX, we compare at 5\(\times\) and 7\(\times\) speedups; for SD3, at 2\(\times\) and 3\(\times\). NFE (number of function evaluations) refers to the number of inference steps. Zoomed-in regions on the right highlight that RALU preserves fine-grained details and avoids artifacts more effectively than the other baselines, even under high speedups. More results are provided in the supplementary material. Best viewed in zoom.

BibTeX

@article{jeong2025upsample,
      title={Upsample what matters: Region-adaptive latent sampling for accelerated diffusion transformers},
      author={Jeong, Wongi and Lee, Kyungryeol and Seo, Hoigi and Chun, Se Young},
      journal={arXiv preprint arXiv:2507.08422},
      year={2025}
    }