The arXiv preprint is not yet publicly available.
Image generative models have become indispensable tools to yield exquisite high-resolution (HR) images for everyone, ranging from general users to professional designers. However, a desired outcome often requires generating a large number of HR images with different prompts and seeds, resulting in high computational cost for both users and service providers. Generating low-resolution (LR) images first could alleviate computational burden, but it is not straightforward how to generate LR images that are perceptually consistent with their HR counterparts. Here, we consider the task of generating high-fidelity LR images, called Previews, that preserve perceptual similarity of their HR counterparts for an efficient workflow, allowing users to identify promising candidates before generating the final HR image. We propose the commutator-zero condition to ensure the LR-HR perceptual consistency for flow matching models, leading to the proposed training-free solution with downsampling matrix selection and commutator-zero guidance. Extensive experiments show that our method can generate LR images with up to 33% computation reduction while maintaining HR perceptual consistency. When combined with existing acceleration techniques, our method achieves up to 3\(\times\) speedup. Moreover, our formulation can be extended to image manipulations, such as warping and translation, demonstrating its generalizability.
Why not super-resolution?
Compliance and commutative-ness
Compliance of the trajectory.
To assess the feasibility of producing an LR image identical to the downsampled version of its HR counterpart, we define a flow ODE over the downsampled trajectory \( x_t^\downarrow \) as follows:
\[ d x_t^\downarrow = \mathbf{D} \, v_\theta(x_t, t)\, dt, \]
where \( \mathbf{D} \in \mathbb{R}^{\frac{hw}{s^2} \times hw} \) is a downsampling matrix with height \( h \), width \( w \), and scale factor \( s \). Let \( x_1^\downarrow \) and \( x_1 \) denote the final LR and HR images obtained from the trajectories \( \{x_t^\downarrow\} \) and \( \{x_t\} \), respectively. For the Preview generation task to be valid, the compliance condition \( x_1^\downarrow = \mathbf{D} x_1 \) should hold. Although Zhang et al.~\cite{zhang2024flow} noted that this condition is not strictly satisfied in learned flow matching models, they reported strong empirical performance under this assumption. Similarly, we observe that assuming compliance for the downsampling operator yields consistent and effective results.
Commutative-ness of flow-matching.
We aim to quickly synthesize the downsampled image that preserves the content of the original high-resolution image. To this end, we accelerate sampling by feeding the learned flow matching model \( v_\theta \) with the downsampled latent \( \mathbf{D}x_t \) instead of the high-resolution latent \( x_t \). However, this substitution raises a critical issue: Does the following commutator condition hold?
\[ [\mathbf{D}, v_\theta](x_t, t) \triangleq \mathbf{D} v_\theta(x_t, t) - v_\theta(\mathbf{D}x_t, t) \stackrel{?}{=} \mathbf{0}. \]
In general, this commutator-zero condition does not hold. Thus, we propose a method that minimizes the norm \( \|[\mathbf{D}, v_\theta](x_t, t)\| \) with respect to the controllable elements \( \mathbf{D} \) and \( x_t \).
Compliance and commutative-ness
We found that an appropriate choice of the downsampling matrix \( \mathbf{D} \) alone reduces \( \|[\mathbf{D}, v_\theta](x_t, t)\| \), improving alignment with HR images. However, non-binary \( \mathbf{D} \) may introduce noise correlation, and optimizing \( \mathbf{D} \) directly is computationally expensive.
For each spatial block of size \( s \times s \), we define a block-wise downsampling matrix:
\[ \mathbf{D}_{s\times s}:\mathbb{R}^{(s\times s)\times d} \rightarrow \mathbb{R}^{d}. \]
Aggregating block-wise matrices over the spatial domain forms the global operator:
\[ \mathbf{D}_k \triangleq \Bigg( \bigoplus_{i=1}^{h/s} \;\bigoplus_{j=1}^{w/s} \mathbf{D}_{s\times s, k}^{(i,j)} \Bigg) \Pi, \quad k\in \{1,\dots,s^2\}. \]
The candidate set is defined as:
\[ \mathcal{D}_{\text{down}} \triangleq \{\mathbf{D}_1,\dots,\mathbf{D}_{s^2}\}, \quad \mathbf{D}_i \odot \mathbf{D}_j = \mathbf{0} \;(i\neq j). \]
We select the optimal matrix via:
\[ \mathbf{D}^\ast = \arg\min_{i=1,\dots,s^2} \|[\mathbf{D}_i, v_\theta](x_t, t)\|. \]
The downsampled latent is defined as:
\[ x_t^\downarrow \triangleq \mathbf{D}^\ast x_t. \]
Commutator-Zero Guidance
Instead of using gradient backpropagation to update \( x_t \), we adopt a fixed-point iteration–like update rule:
\[ x_t^{\downarrow,k+1} = x_t^{\downarrow,k} + \alpha \left( \mathbf{D}^\ast v_\theta(x_t,t) - v_\theta(x_t^{\downarrow,k}, t) \right). \]
However, computing \( v_\theta(x_t,t) \) at every step is expensive. We therefore leverage a property of rectified flow:
\[ v_\theta(x_{t_0}, t) \approx v_\theta(x_{t_0+\Delta t}, t+\Delta t). \]
Using this approximation, we reuse the previously computed velocity at timestep \( t_D \):
\[ x_t^{\downarrow,k+1} = x_t^{\downarrow,k} + \alpha \left( \mathbf{D}^\ast v_\theta(x_{t_D}, t_D) - v_\theta(x_t^{\downarrow,k}, t) \right). \]
This enables efficient updates using only \( v_\theta(x_t^{\downarrow,k}, t) \), which is substantially cheaper than evaluating \( v_\theta(x_t, t) \).
Overall Framework
Quantitative Results
Quantitative comparison on FLUX.1-dev and SD3.5-L. Using the FLUX.1-dev and Stable Diffusion 3.5-Large (SD3.5-L) models with \( \text{NFE}=30 \), we generate HR (\(1024\times1024\)) reference images. We compare three variants: (i) reduced-NFE generation, (ii) LR (\(512\times512\)) generation with the same NFE, and (iii) a na"ive baseline applying nearest downsampling at timestep \( t_D \). Each method is evaluated for computational efficiency, image quality, and consistency with the reference. Numbers in parentheses denote the NFE used; if omitted, \( \text{NFE}=30 \) is applied as in the reference. We denote the best performance in bold and second-best performance with underline.
Qualitative Results
Qualitative comparison of our proposed method. While other simple alternatives often result in changes to composition, object size, or even color tone, our proposed approach synthesizes low-resolution images faster while preserving the composition and color fidelity of the original image. The prompts used for image generation are provided in the supplementary materials.
Generalization of Commutator-zero Guidance
Generalization of commutator-zero guidance. We show that commutator-zero guidance can be expanded to other operations. For warping, a large kernel (\(128\times128\)) with correlation correction produces distortion, while our method effectively handles artifacts. For translation, na"ive cause noticeable difference and unintended objects, whereas ours preserves image content.
Video Previews
We conducted video synthesis experiments with HunyuanVideo and found that the reduced timestep baseline (NFE 50→30) produces inconsistent compositions as frames increase, while our method generates perceptually consistent LR videos with 1.75× speedup (1,661 sec (Original) / 1,025 sec (Reduced) / 951 sec (Ours) on 120 frames generation).