Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

ICML 2025

¹Dept. of Electrical and Computer Engineering, Seoul National University, Republic of Korea ²School of Electrical and Computer Engineering, Cornell Tech, USA ³INMC & IPAI, Seoul National University, Republic of Korea

Abstract

Large-scale text encoders in text-to-image (T2I) diffusion models have demonstrated exceptional performance in generating high-quality images from textual prompts. Unlike denoising modules that rely on multiple iterative steps, text encoders require only a single forward pass to produce text embeddings. However, despite their minimal contribution to total inference time and floating-point operations (FLOPs), text encoders demand significantly higher memory usage, up to eight times more than denoising modules. To address this inefficiency, we propose Skip and Re-use layers (Skrr), a simple yet effective pruning strategy specifically designed for text encoders in T2I diffusion models. Skrr exploits the inherent redundancy in transformer blocks by selectively skipping or reusing certain layers in a manner tailored for T2I tasks, thereby reducing memory consumption without compromising performance. Extensive experiments demonstrate that Skrr maintains image quality comparable to the original model even under high sparsity levels, outperforming existing blockwise pruning methods. Furthermore, Skrr achieves state-of-the-art memory efficiency while preserving performance across multiple evaluation metrics, including the FID, CLIP, DreamSim, and GenEval scores.

Introduction

(a) FLOPs distribution during image generation in Stable Diffusion 3 (SD3). (b) Parameter distribution across modules in SD3. The text encoders contributes less than 0.5% to the overall FLOPs but account for over 70% of the total model parameters. For VAE, only the decoder was considered.

Method

The visualization of overall framework of Skrr. (a) shows the \textit{Skip} phase, which repeatedly assesses each sub-block by determining the output discrepancy (Disc.) between the dense and skipped models using a calibration dataset (Calib. data). To account for block interactions, it keeps the top-k options with the smallest discrepancies and uses beam search for refined selection. (b) presents the \textit{Re-use} phase, evaluating if recycling remaining block instead of skipped sub-blocks results in a smaller output discrepancy. If so, hidden states are fed back into the chosen layers. This two-phase approach efficiently reduces model size with minimal T2I performance loss.

Algotithms

Before delving into the specifics, we define two key discrepancy metrics: \( D_{f_c} \), derived from \( \text{Metric}_2 \), and \( D_{f_\varnothing} \), representing \( \text{Metric}_2 \) with null inputs. The algorithm is inspired by a beam search, iterating over each unskipped sub-block while maintaining the \( k \) beams with the smallest sum of \( D_{f_c} \) and \( D_{f_\varnothing} \). This process is repeated from the \( k \) beams, iteratively updating them to ensure smaller \( D \). Upon traversing all blocks, the algorithm produces a list of Skip indices \( \mathcal{S}^* \), effectively capturing inter-block interactions. Finally, the blocks are pruned sequentially according to the \( \mathcal{S}^* \) to achieve the desired sparsity.

Results

Quantitative Results

Quantitative comparisons of Skrr with baselines. We compared Skrr with the baselines of ShortGPT, LaCo, and FinerCut under three different sparsity scenarios on PixArt-\( \Sigma \). The results show that Skrr reliably maintains image fidelity and performs comparable to the dense model across all given sparsity levels. Unlike ShortGPT and LaCo, FinerCut and Skrr use sub-block pruning, hindering direct sparsity level alignment. Sparsity levels were matched as closely as possible for fair evaluation, reflecting how much the compressed model's parameters differ from the dense encoder. (\( \uparrow / \downarrow \) denotes that a higher / lower metric is favorable.)

Qualitative Results

Comparison of images generated with baseline and Skrr-compressed text encoders across PixArt-\( \Sigma \), Stable Diffusion 3 (SD3), and FLUX.1-dev. At low sparsity (level 1 — 24.3% for ShortGPT and Laco, 26.3% for FinerCut, and 27.0% for Skrr), both methods perform comparably to dense models, but Skrr outperforms at higher sparsity (level 2 — 32.4% for ShortGPT and Laco, 32.2% for FinerCut, and 32.4% for Skrr; level 3 — 40.5% for ShortGPT and Laco, 41.7% for FinerCut, and 41.9% for Skrr), maintaining alignment to the dense model and preserving details in the prompt such as "glasses", "colorful apron", and "paint-splattered hands", where baseline methods fail.

Ablation Study

Ablation study on Re-use. Without Re-use, Skip alone leads to images that often misalign with the prompt, while Re-use ensures more faithful adherence to the prompt.