Latent Diffusion Models: High-Resolution Image Synthesis

Robin Rombach; Andreas Blattmann; Dominik Lorenz; Patrick Esser; Björn Ommer

TL;DR

Latent Diffusion Models solve the biggest problem with diffusion-based image generation: computational cost. Standard diffusion models like DDPM operate directly on pixel space, processing 786,432 values (512×512×3) at every single denoising step. LDMs fix this by first compressing images into a compact latent space using a pretrained VAE — reducing the representation from 512×512×3 down to 64×64×4, a 48× compression. The diffusion process then runs entirely in this latent space, achieving the same perceptual quality at a fraction of the compute. Add cross-attention layers to condition on text embeddings, and you get the architecture behind Stable Diffusion — the model that made high-quality text-to-image generation accessible to everyone.

The Problem: Diffusion Is Expensive

Diffusion models produce remarkable image quality by learning to reverse a gradual noising process. Starting from pure Gaussian noise, a neural network (typically a U-Net) iteratively predicts and removes noise over many steps, eventually producing a clean image. The training objective is elegant:

ℒ_DM = 𝔼_{x, ε ∼ 𝒞n}(0,1), t [ \| ε - ε_θ(x_t, t) \|² ]

But there’s a catch. When diffusion operates directly in pixel space, every denoising step processes the full-resolution image. For a 512×512×3 image, that’s 786,432 values through the U-Net at each of the 50–1000 denoising steps. Training requires hundreds of GPU-days on high-end hardware. Generating a single image takes minutes. And scaling to higher resolutions is quadratically expensive — doubling resolution quadruples compute.

This computational burden meant that, before LDMs, high-resolution diffusion was practical only for well-resourced research labs. The question was: can we preserve diffusion’s quality while dramatically reducing its computational requirements?

The Solution: Move to Latent Space

The key insight of LDMs is a separation of concerns. Image generation involves two distinct phases: perceptual compression (learning a compact representation that captures visual structure) and semantic generation (learning the distribution of meaningful images). Pixel-space diffusion conflates these two phases — the model must simultaneously learn what images look like at a low level and how to generate semantically meaningful content.

LDMs separate these phases by introducing a two-stage approach:

Stage 1: Train a VAE (Variational Autoencoder) to compress images into a compact latent space. The encoder maps 512×512×3 images to 64×64×4 latent representations, and the decoder reconstructs them back. This is trained once and frozen.
Stage 2: Train a diffusion model to operate entirely within this latent space. The U-Net processes 64×64×4 tensors instead of 512×512×3 tensors — a 48× reduction in dimensionality.

The diffusion training objective becomes:

ℒ_LDM = 𝔼_𝒞e(x), ε ∼ \mathcal{n(0,1), t} [ \| ε - ε_θ(z_t, t) \|² ]

where z_t is the noised latent representation and 𝒞e(x) is the VAE encoder output.

Why Latent Space Works

The critical question is whether 48× compression loses too much information. The answer lies in what the VAE learns to discard. Natural images contain enormous amounts of high-frequency detail — pixel-level noise, imperceptible texture variations, compression artifacts — that humans don’t perceive. A well-trained VAE with perceptual and adversarial losses learns to encode exactly the information that matters for visual perception, discarding the rest.

The numbers tell the story. Pixel-space diffusion processes 786,432 values per step. Latent-space diffusion processes 16,384 values per step. Across 50 denoising steps, that’s the difference between ~39 million and ~819 thousand total operations per image — roughly a 48× reduction. In practice, the savings are even larger because the U-Net’s computational cost scales super-linearly with spatial resolution due to self-attention layers.

The VAE: Perceptual Compression

The VAE is trained with a combination of losses that ensure high-fidelity reconstruction while learning a well-structured latent space:

ℒ_VAE = \| \mathbf{x} - D(E(\mathbf{x})) \|² + λ_KL · D_KL(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z})) + λ_perc · ℒ_perc + λ_adv · ℒ_adv

The reconstruction loss (\| \mathbf{x} - D(E(\mathbf{x})) \|²) ensures the decoder can reconstruct the input from the latent. The KL divergence term regularizes the latent distribution to be close to a standard Gaussian 𝒩(0, 1), which is important because the diffusion process starts from Gaussian noise and needs a well-behaved latent space to denoise into. The perceptual loss (ℒ_perc) compares deep features rather than raw pixels, ensuring that reconstructions are perceptually similar even if they differ at the pixel level. The adversarial loss (ℒ_adv) adds a discriminator to encourage sharp, realistic reconstructions.

The paper experiments with different downsampling factors f and finds that f = 8 (spatial downsampling from 512 to 64) provides the best trade-off. Lower compression (f = 4) preserves more detail but reduces computational savings. Higher compression (f = 16 or f = 32) saves more compute but loses perceptually important information, degrading generation quality.

The KL regularization weight is kept small (λ_KL around 10^-6) to avoid excessive smoothing of the latent space, which would compromise reconstruction quality. This "KL-regularized" variant (used in Stable Diffusion) strikes a balance between latent space structure and reconstruction fidelity.

Conditioning with Cross-Attention

To enable text-to-image generation, LDMs introduce cross-attention layers into the U-Net architecture. A frozen text encoder — CLIP or BERT — processes the text prompt into a sequence of token embeddings τ_θ(y) ∈ ℝ^{M × d_τ}, where M is the number of tokens and d_τ is the embedding dimension.

Inside each U-Net block, cross-attention layers allow every spatial position in the latent to attend to all text token embeddings:

\text{Attention}(Q, K, V) = \text{softmax}(QK^T√(d)) V

where the queries come from the latent features and the keys/values come from the text embeddings:

Q = W_Q · \varphi_i(z_t), K = W_K · τ_θ(y), V = W_V · τ_θ(y)

Here \varphi_i(z_t) is the intermediate representation of the U-Net at layer i, flattened to a sequence of spatial tokens. Each spatial position produces a query that attends to all text tokens, learning which parts of the prompt are relevant for generating content at that location.

This mechanism is remarkably flexible. Semantic tokens like “car” or “road” develop focused attention patterns that target specific spatial regions, while function words like “a” or “on” produce diffuse attention. The model learns these spatial correspondences entirely from image-caption pairs — no explicit spatial supervision is provided.

Cross-attention is also the mechanism that makes LDMs general-purpose conditional generators. The same architecture can condition on text, semantic maps, bounding boxes, or any other modality by simply changing the conditioning encoder τ_θ.

Classifier-Free Guidance

At inference time, LDMs use classifier-free guidance (CFG) to control the trade-off between sample diversity and fidelity to the text prompt. During training, the text condition is randomly dropped (replaced with a null token) with probability p_uncond (typically 10–20%). This trains the model to generate both conditionally and unconditionally.

At inference, each denoising step runs the model twice: once with the text prompt (ε_θ(z_t, t, y)) and once without (ε_θ(z_t, t, \varnothing)). The final noise prediction extrapolates beyond the conditional prediction:

\tilde{ε}_θ(z_t, t, y) = ε_θ(z_t, t, \varnothing) + w · (ε_θ(z_t, t, y) - ε_θ(z_t, t, \varnothing))

The guidance scale w controls the strength. At w = 1, the output is identical to standard conditional sampling. As w increases, the model produces outputs that more strongly match the text prompt but with less diversity. The practical sweet spot is around w = 7.5 — strong enough for high-fidelity text alignment without introducing the oversaturation and artifacts that appear at extreme values like w = 20.

The cost of CFG is doubling the computation per denoising step (two forward passes). But because LDMs already operate in the compressed latent space, this overhead is far more manageable than it would be in pixel space.

How LDM Compares

Generative Model Comparison

How Latent Diffusion compares to pixel-space diffusion, GANs, and autoregressive methods across key dimensions.

Method	Compute Efficiency	Image Quality	Text Control	Training Stability	Resolution Scaling
Latent Diffusion (LDM)	48× compression	SOTA FID scores	Cross-attention	Stable convergence	Resolution-agnostic latent
Pixel Diffusion (DDPM)	Full-res every step	High fidelity	Classifier guidance	Well-understood	Quadratic cost
GAN (StyleGAN)	Single forward pass	Sharp outputs	No native support	Mode collapse risk	Progressive growing
Autoregressive (DALL-E 1)	Sequential tokens	Discretization artifacts	Token conditioning	Standard LM training	Token count scales

Latent Diffusion (LDM)

Compute Efficiency:

48× compression

Image Quality:

SOTA FID scores

Text Control:

Cross-attention

Training Stability:

Stable convergence

Resolution Scaling:

Resolution-agnostic latent

Pixel Diffusion (DDPM)

Compute Efficiency:

Full-res every step

Image Quality:

High fidelity

Text Control:

Classifier guidance

Training Stability:

Well-understood

Resolution Scaling:

Quadratic cost

GAN (StyleGAN)

Compute Efficiency:

Single forward pass

Image Quality:

Sharp outputs

Text Control:

No native support

Training Stability:

Mode collapse risk

Resolution Scaling:

Progressive growing

Autoregressive (DALL-E 1)

Compute Efficiency:

Sequential tokens

Image Quality:

Discretization artifacts

Text Control:

Token conditioning

Training Stability:

Standard LM training

Resolution Scaling:

Token count scales

LDM excels at

Compute efficiency — 48× compression means diffusion operates on 16K values instead of 786K per step
Text-guided generation — cross-attention provides fine-grained spatial control over what each text token generates
Resolution scaling — the latent space stays fixed at 64×64 regardless of output resolution, with the VAE handling up/downscaling

Trade-offs to consider

Two-stage training — the VAE must be trained separately before the diffusion model, adding pipeline complexity
VAE reconstruction ceiling — output quality is fundamentally bounded by how well the VAE decoder reconstructs from the latent
Multi-step inference — unlike GANs (single pass), LDMs require 20–50 denoising steps at generation time

Key Takeaways

Separation of perceptual and semantic compression is the core insight. By training a VAE to handle perceptual compression separately, the diffusion model can focus entirely on learning the semantic distribution of images in a compact latent space. This decoupling reduces computational cost by ~48× without sacrificing perceptual quality.
Cross-attention is a general-purpose conditioning mechanism. Rather than baking text conditioning into the architecture, LDMs use cross-attention to inject arbitrary conditioning signals. This design enables the same architecture to handle text, layout, depth maps, and other modalities simply by swapping the conditioning encoder.
The VAE downsampling factor is a critical design choice. Too little compression (f = 4) wastes computation. Too much (f = 32) destroys perceptually important information. The f = 8 sweet spot (512 → 64 spatial) preserves structure while enabling practical diffusion.
Classifier-free guidance trades diversity for fidelity. By training with random condition dropout and extrapolating at inference, CFG gives users a single knob to control how closely outputs match the text prompt. This is more flexible than classifier guidance (which requires a separately trained classifier) and produces higher-quality results.
Two-stage training enables modular improvements. Because the VAE and diffusion model are trained independently, each component can be improved separately. A better VAE directly improves all downstream generation without retraining the diffusion model, and vice versa. This modularity has proven essential for the rapid iteration seen in the Stable Diffusion ecosystem.

Impact and Legacy

The LDM paper, published at CVPR 2022, is arguably the most impactful generative modeling paper of the decade. Its open-source release as Stable Diffusion in August 2022 democratized high-quality image generation overnight. For the first time, anyone with a consumer GPU could generate photorealistic images from text prompts in seconds.

The architecture has spawned an enormous ecosystem of improvements and applications:

Stable Diffusion XL (SDXL) scaled the approach with a larger U-Net and a two-stage pipeline (base + refiner), producing higher-quality outputs at 1024×1024 resolution while maintaining the latent-space efficiency.
Stable Diffusion 3 replaced the U-Net with a Diffusion Transformer (DiT) architecture, using the same latent-space principle but with transformer blocks instead of convolutional layers. This enabled better scaling and improved text rendering.
ControlNet demonstrated that the cross-attention conditioning mechanism could be extended with spatial control signals — edge maps, depth maps, pose skeletons — enabling precise compositional control over generated images.
IP-Adapter, LoRA, and Textual Inversion showed that the modular architecture allowed efficient fine-tuning and personalization without retraining the full model, enabling users to adapt Stable Diffusion to specific styles, subjects, and domains.
Latent Video Diffusion extended the approach to video generation, applying diffusion in a spatiotemporal latent space to produce coherent video sequences.

The fundamental insight — that diffusion should operate in a learned latent space rather than raw pixel space — has become the default paradigm for generative models. Nearly every subsequent text-to-image system, from DALL-E 3 to Imagen to Midjourney, adopts some form of latent-space generation. The LDM paper didn’t just introduce a better architecture; it established the design principles that define modern generative AI.

VAE — the variational autoencoder whose encoder-decoder compresses images into the latent space that latent diffusion runs its denoising process in
Attention Is All You Need — The transformer architecture underlying the text encoders and U-Net attention layers
CLIP — The contrastive vision-language model used as the text encoder in Stable Diffusion
Vision Transformer — The ViT architecture that inspired Diffusion Transformers (DiT)