Training-free, Perceptually Consistent Low-Resolution Previews with High-Resolution Image for Efficient Workflows of Diffusion Models

CVPR 2026

1Dept. of Electrical and Computer Engineering, 2INMC & IPAI
Seoul National University, Republic of Korea,

The arXiv preprint is not yet publicly available.

Abstract

Image generative models have become indispensable tools to yield exquisite high-resolution (HR) images for everyone, ranging from general users to professional designers. However, a desired outcome often requires generating a large number of HR images with different prompts and seeds, resulting in high computational cost for both users and service providers. Generating low-resolution (LR) images first could alleviate computational burden, but it is not straightforward how to generate LR images that are perceptually consistent with their HR counterparts. Here, we consider the task of generating high-fidelity LR images, called Previews, that preserve perceptual similarity of their HR counterparts for an efficient workflow, allowing users to identify promising candidates before generating the final HR image. We propose the commutator-zero condition to ensure the LR-HR perceptual consistency for flow matching models, leading to the proposed training-free solution with downsampling matrix selection and commutator-zero guidance. Extensive experiments show that our method can generate LR images with up to 33% computation reduction while maintaining HR perceptual consistency. When combined with existing acceleration techniques, our method achieves up to 3\(\times\) speedup. Moreover, our formulation can be extended to image manipulations, such as warping and translation, demonstrating its generalizability.

Motivation

Intro

Motivation of Preview Generation. Most users of generative AI usually produce diverse candidate images through repeated trials from multiple seeds (or prompts) to obtain a desired result. Our goal is to accelerate this workflow by generating low-resolution images that share the same content as their high-resolution counterparts but can be produced much faster in a training-free manner.



Proposed Method

Intro

Comparison between SR-upsampled and directly generated HR images. Using FLUX.1-dev, we (a) generated a low-resolution (LR, 256\(\times\)256) image followed by 4\(\times\) super-resolution (SR) to obtain a high-resolution (HR, 1024\(\times\)1024) image, and (b) directly generated an HR (1024\(\times\)1024) image. As shown in the close-up view, (a) fails to recover eye-region and fine fur details lost at the LR stage, whereas (b) preserves both global structure and fine details with high fidelity.


Compliance of the trajectory.

To assess the feasibility of producing an LR image identical to the downsampled version of its HR counterpart, we define a flow ODE over the downsampled trajectory \( x_t^\downarrow \) as follows:

\[ d x_t^\downarrow = \mathbf{D} \, v_\theta(x_t, t)\, dt, \]

where \( \mathbf{D} \in \mathbb{R}^{\frac{hw}{s^2} \times hw} \) is a downsampling matrix with height \( h \), width \( w \), and scale factor \( s \). Let \( x_1^\downarrow \) and \( x_1 \) denote the final LR and HR images obtained from the trajectories \( \{x_t^\downarrow\} \) and \( \{x_t\} \), respectively. For the Preview generation task to be valid, the compliance condition \( x_1^\downarrow = \mathbf{D} x_1 \) should hold. Although Zhang et al.~\cite{zhang2024flow} noted that this condition is not strictly satisfied in learned flow matching models, they reported strong empirical performance under this assumption. Similarly, we observe that assuming compliance for the downsampling operator yields consistent and effective results.


Commutative-ness of flow-matching.

We aim to quickly synthesize the downsampled image that preserves the content of the original high-resolution image. To this end, we accelerate sampling by feeding the learned flow matching model \( v_\theta \) with the downsampled latent \( \mathbf{D}x_t \) instead of the high-resolution latent \( x_t \). However, this substitution raises a critical issue: Does the following commutator condition hold?

\[ [\mathbf{D}, v_\theta](x_t, t) \triangleq \mathbf{D} v_\theta(x_t, t) - v_\theta(\mathbf{D}x_t, t) \stackrel{?}{=} \mathbf{0}. \]

In general, this commutator-zero condition does not hold. Thus, we propose a method that minimizes the norm \( \|[\mathbf{D}, v_\theta](x_t, t)\| \) with respect to the controllable elements \( \mathbf{D} \) and \( x_t \).


We found that an appropriate choice of the downsampling matrix \( \mathbf{D} \) alone reduces \( \|[\mathbf{D}, v_\theta](x_t, t)\| \), improving alignment with HR images. However, non-binary \( \mathbf{D} \) may introduce noise correlation, and optimizing \( \mathbf{D} \) directly is computationally expensive.

For each spatial block of size \( s \times s \), we define a block-wise downsampling matrix:

\[ \mathbf{D}_{s\times s}:\mathbb{R}^{(s\times s)\times d} \rightarrow \mathbb{R}^{d}. \]

Aggregating block-wise matrices over the spatial domain forms the global operator:

\[ \mathbf{D}_k \triangleq \Bigg( \bigoplus_{i=1}^{h/s} \;\bigoplus_{j=1}^{w/s} \mathbf{D}_{s\times s, k}^{(i,j)} \Bigg) \Pi, \quad k\in \{1,\dots,s^2\}. \]

The candidate set is defined as:

\[ \mathcal{D}_{\text{down}} \triangleq \{\mathbf{D}_1,\dots,\mathbf{D}_{s^2}\}, \quad \mathbf{D}_i \odot \mathbf{D}_j = \mathbf{0} \;(i\neq j). \]

We select the optimal matrix via:

\[ \mathbf{D}^\ast = \arg\min_{i=1,\dots,s^2} \|[\mathbf{D}_i, v_\theta](x_t, t)\|. \]

The downsampled latent is defined as:

\[ x_t^\downarrow \triangleq \mathbf{D}^\ast x_t. \]


Instead of using gradient backpropagation to update \( x_t \), we adopt a fixed-point iteration–like update rule:

\[ x_t^{\downarrow,k+1} = x_t^{\downarrow,k} + \alpha \left( \mathbf{D}^\ast v_\theta(x_t,t) - v_\theta(x_t^{\downarrow,k}, t) \right). \]

However, computing \( v_\theta(x_t,t) \) at every step is expensive. We therefore leverage a property of rectified flow:

\[ v_\theta(x_{t_0}, t) \approx v_\theta(x_{t_0+\Delta t}, t+\Delta t). \]

Using this approximation, we reuse the previously computed velocity at timestep \( t_D \):

\[ x_t^{\downarrow,k+1} = x_t^{\downarrow,k} + \alpha \left( \mathbf{D}^\ast v_\theta(x_{t_D}, t_D) - v_\theta(x_t^{\downarrow,k}, t) \right). \]

This enables efficient updates using only \( v_\theta(x_t^{\downarrow,k}, t) \), which is substantially cheaper than evaluating \( v_\theta(x_t, t) \).


Overall framework

Overall framework. (Left, Top) Overview of our proposed framework. Sampling is first performed in the high-resolution (HR) space up to timestep \(t_D\), after which downsampling is applied using the selected downsample matrix from (Right, Top). To maintain alignment with HR sampling, commutator-zero guidance is applied as shown in (Right, Bottom). Finally, as illustrated in (Left, Bottom), our method produces a low-resolution image with lower DreamSim score, which indicates better LR-HR consistency.



Experiments

Intro

Quantitative comparison on FLUX.1-dev and SD3.5-L. Using the FLUX.1-dev and Stable Diffusion 3.5-Large (SD3.5-L) models with \( \text{NFE}=30 \), we generate HR (\(1024\times1024\)) reference images. We compare three variants: (i) reduced-NFE generation, (ii) LR (\(512\times512\)) generation with the same NFE, and (iii) a na"ive baseline applying nearest downsampling at timestep \( t_D \). Each method is evaluated for computational efficiency, image quality, and consistency with the reference. Numbers in parentheses denote the NFE used; if omitted, \( \text{NFE}=30 \) is applied as in the reference. We denote the best performance in bold and second-best performance with underline.


Intro

Qualitative comparison of our proposed method. While other simple alternatives often result in changes to composition, object size, or even color tone, our proposed approach synthesizes low-resolution images faster while preserving the composition and color fidelity of the original image. The prompts used for image generation are provided in the supplementary materials.


Intro

Generalization of commutator-zero guidance. We show that commutator-zero guidance can be expanded to other operations. For warping, a large kernel (\(128\times128\)) with correlation correction produces distortion, while our method effectively handles artifacts. For translation, na"ive cause noticeable difference and unintended objects, whereas ours preserves image content.


We conducted video synthesis experiments with HunyuanVideo and found that the reduced timestep baseline (NFE 50→30) produces inconsistent compositions as frames increase, while our method generates perceptually consistent LR videos with 1.75× speedup (1,661 sec (Original) / 1,025 sec (Reduced) / 951 sec (Ours) on 120 frames generation).