Skip to main content

DDPM: Denoising Diffusion Probabilistic Models

How diffusion models learn to generate images by reversing a gradual noising process — the foundation of Stable Diffusion, DALL-E, and modern image generation.

Jonathan Ho, Ajay Jain +115 min read|Original Paper|generative-modelsdiffusiondenoising+3
Best viewed on desktop for optimal interactive experience

TL;DR

DDPM shows that you can generate high-quality images by learning to reverse a simple noising process. Start with a clean image, add Gaussian noise step by step until it becomes pure static, then train a neural network to undo each step. The result is a generative model that rivals GANs in sample quality while being dramatically more stable to train — no adversarial dynamics, no mode collapse, just a straightforward MSE loss on predicted noise.

The Core Idea: Noise and Denoise

The central insight of diffusion models is elegantly simple: if you can systematically destroy information, you can learn to reverse that destruction.

The forward process q takes a clean data sample \mathbf{x}0 and gradually adds Gaussian noise over T timesteps, producing a sequence \mathbf{x}1, \mathbf{x}2, \ldots, \mathbf{x}T where each step makes the image slightly noisier. By the final step, \mathbf{x}T is indistinguishable from pure Gaussian noise — all information about the original image has been erased.

The reverse process p_θ learns to undo this destruction. A neural network (typically a U-Net) is trained to take a noisy image \mathbf{x}t and predict the noise that was added, effectively learning to denoise one step at a time. At generation time, we start from pure noise \mathbf{x}T ∼ 𝒩(\mathbf{0}, \mathbf{I}) and iteratively apply the learned denoiser, producing progressively cleaner images until we arrive at a realistic sample \mathbf{x}0.

Each forward step is a simple Gaussian transition:

q(\mathbf{x}t | \mathbf{x}t-1) = 𝒩(\mathbf{x}t; √(1-βt)\,\mathbf{x}t-1,\, βt\mathbf{I})

where βt is a small noise variance that controls how much noise is added at step t. The signal is scaled by √(1-βt) to keep the variance bounded, while fresh noise with variance βt is injected.

The Forward Process: Destroying Information

The forward process is a fixed Markov chain — it has no learnable parameters. At each timestep t, the image is scaled down slightly and fresh Gaussian noise is added. The noise variance βt follows a predetermined schedule, typically increasing linearly from β1 = 10-4 to βT = 0.02 over T = 1000 steps.

A critical property makes training efficient: thanks to the reparameterization trick, we can sample \mathbf{x}t at any arbitrary timestep directly from \mathbf{x}0 without running through all previous steps. Define αt = 1 - βt and \bar{α}t = Πs=1t αs. Then:

q(\mathbf{x}t | \mathbf{x}0) = 𝒩(\mathbf{x}t;\, √(\bar{α)t}\,\mathbf{x}0,\, (1-\bar{α}t)\mathbf{I})

This means we can write any noisy sample as a simple linear combination:

\mathbf{x}t = √(\bar{α)t}\,\mathbf{x}0 + √(1-\bar{α)t}\,\boldsymbol{ε}, \boldsymbol{ε} ∼ 𝒩(\mathbf{0}, \mathbf{I})

The coefficient √(\bar{α)t} controls how much of the original signal remains, while √(1-\bar{α)t} controls the noise amplitude. As t increases, \bar{α}t decreases toward zero, and the signal is progressively overwhelmed by noise. At t = T, \bar{α}T ≈ 0 and the sample is essentially pure noise.

Noise Schedules: How Fast to Add Noise

The noise schedule — the sequence of βt values across timesteps — determines how quickly information is destroyed during the forward process. This choice has a significant impact on both training efficiency and sample quality.

The linear schedule used in the original DDPM paper increases βt linearly from 10-4 to 0.02. This works well but has a weakness: because \bar{α}t drops too quickly in early timesteps, the model spends most of its capacity learning to denoise heavily corrupted images. The early, lightly-noised timesteps — where fine details matter most — are relatively underrepresented.

The cosine schedule, introduced by Nichol and Dhariwal in their Improved DDPM paper (2021), addresses this by designing \bar{α}t directly as a cosine curve:

\bar{α}t = f(t)f(0), f(t) = cos\!(t/T + s1 + s · π2)2

where s = 0.008 is a small offset that prevents βt from being too small near t = 0. The cosine schedule preserves signal much longer in the early timesteps, giving the model more training signal at low noise levels where perceptual quality is determined.

The Reverse Process: Learning to Denoise

The reverse process is where learning happens. Given a noisy image \mathbf{x}t and the current timestep t, a U-Net architecture predicts the noise \boldsymbol{ε} that was added. The architecture uses sinusoidal timestep embeddings (similar to positional encodings in transformers) to condition the network on t, telling it how much noise to expect.

The U-Net is a natural fit for this task: its encoder-decoder structure with skip connections allows it to capture both global structure and fine-grained details. The encoder progressively downsamples the spatial dimensions to capture semantic content, while the decoder upsamples back to the original resolution using skip connections to preserve spatial detail. Self-attention layers at lower resolutions help the model reason about global image coherence.

The simplified training objective from the paper is a straightforward MSE loss on the predicted noise:

Lsimple = 𝔼t, \mathbf{x0, \boldsymbol{ε}} [ \| \boldsymbol{ε} - \boldsymbol{ε}_θ(√(\bar{α)t} \mathbf{x}0 + √(1-\bar{α)t} \boldsymbol{ε}, t) \|2 ]

The expectation is over uniformly sampled timesteps t ∼ \text{Uniform}(1, T), training images \mathbf{x}0 ∼ q(\mathbf{x}0), and noise samples \boldsymbol{ε} ∼ 𝒩(\mathbf{0}, \mathbf{I}). This simplified objective drops the weighting term from the full variational lower bound but works better in practice — the paper found it produces higher sample quality despite being a looser bound.

There are three common ways to parameterize what the network predicts:

  • \boldsymbol{ε}-prediction (original DDPM): predict the noise that was added. The denoising update rule then uses this to compute \mathbf{x}t-1.

  • \mathbf{x}0-prediction: predict the clean image directly. The posterior mean formula then combines this with \mathbf{x}t to find \mathbf{x}t-1.

  • v-prediction (from progressive distillation): predict a velocity \mathbf{v} = √(\bar{α)t}\boldsymbol{ε} - √(1-\bar{α)t}\mathbf{x}0, which is numerically more stable at both very low and very high noise levels.

All three parameterizations are mathematically equivalent but have different numerical properties during training.

Sampling: From Noise to Image

Once trained, generating an image follows a simple iterative procedure. Start with \mathbf{x}T ∼ 𝒩(\mathbf{0}, \mathbf{I}) and for each timestep from T down to 1:

\mathbf{x}t-1 = 1√(αt)(\mathbf{x}t - 1-αt√(1-\bar{α)t} \boldsymbol{ε}_θ(\mathbf{x}t, t)) + σt \mathbf{z}

where \mathbf{z} ∼ 𝒩(\mathbf{0}, \mathbf{I}) for t > 1 and \mathbf{z} = \mathbf{0} for t = 1. The noise term σt \mathbf{z} adds stochasticity to the sampling process, which helps with diversity.

The major drawback is speed: DDPM requires all T = 1000 sequential denoising steps. This makes generation roughly 1000 times slower than a GAN or VAE, which produce samples in a single forward pass. Two key follow-up works addressed this:

DDIM (Denoising Diffusion Implicit Models, Song et al. 2021) showed that the reverse process can be reformulated as solving an ordinary differential equation (ODE). This allows deterministic sampling and — crucially — enables skipping timesteps. With only 50 steps, DDIM achieves quality comparable to 1000-step DDPM.

DPM-Solver (Lu et al. 2022) applies higher-order numerical ODE solvers (analogous to Runge-Kutta methods) to the diffusion ODE. By using second and third-order solvers, it achieves high-quality samples in as few as 10–20 steps — a 50–100x speedup over the original DDPM.

How DDPM Compares

Generative Model Comparison

How DDPM compares to other generative model families across key dimensions.

DDPM
Training:
Stable MSE loss
Coverage:
Full distribution
Quality:
FID 3.17
Speed:
1000 steps
Likelihood:
Tractable ELBO

Iterative denoising via learned reverse Markov chain

GAN
Training:
Mode collapse risk
Coverage:
Mode dropping
Quality:
Sharp outputs
Speed:
1 forward pass
Likelihood:
Implicit density

Adversarial min-max game between generator and discriminator

VAE
Training:
Stable ELBO
Coverage:
Posterior collapse
Quality:
Often blurry
Speed:
1 forward pass
Likelihood:
ELBO bound

Encoder-decoder with latent Gaussian prior

Normalizing Flow
Training:
Exact likelihood
Coverage:
Bijective mapping
Quality:
Architecture limits
Speed:
Depends on depth
Likelihood:
Exact log p(x)

Invertible transforms with change-of-variables formula

Use DDPM when…
  • You need high-quality samples with full mode coverage — no mode collapse or dropping
  • Training stability is critical — simple MSE loss, no adversarial dynamics
  • You want a tractable likelihood bound for model comparison or density estimation
Consider alternatives when…
  • Sampling speed is paramount — GANs and VAEs generate in a single forward pass
  • You need real-time generation — 1000 sequential denoising steps is prohibitive
  • You require exact likelihoods rather than bounds — normalizing flows are better suited

Key Results

Unconditional Image Generation

On CIFAR-10, DDPM achieves an FID score of 3.17, which at the time of publication was competitive with the best GAN results while being dramatically simpler to train. On 256×256 LSUN datasets, the model produces high-quality bedroom and church images with an FID of 4.90 on LSUN Bedroom.

Connection to Variational Inference

DDPM is formally a hierarchical variational autoencoder with T latent variables. The forward process defines the approximate posterior q(\mathbf{x}1:T | \mathbf{x}0), and the reverse process defines the generative model p_θ(\mathbf{x}0:T). The training objective is derived from the evidence lower bound (ELBO):

log p(\mathbf{x}0) ≥ 𝔼q\![log p_θ(\mathbf{x}0:T)q(\mathbf{x}1:T|\mathbf{x}0)]

The simplified MSE loss used in practice is a reweighted version of this bound that drops the per-timestep weighting factors. While this makes it a looser bound, it empirically leads to better sample quality, suggesting that equal weighting across timesteps provides a more useful training signal.

Log-Likelihood

DDPM achieves a negative log-likelihood of 3.70 bits/dim on CIFAR-10, which is competitive with other likelihood-based models. The Improved DDPM paper later showed that switching to a learned variance (predicting σt in addition to \boldsymbol{ε}) and using the cosine schedule improves this to 2.94 bits/dim — a state-of-the-art result among likelihood-based models.

Key Takeaways

  1. Diffusion as iterative refinement — generating images by gradually denoising from pure noise is a fundamentally different paradigm from adversarial training (GANs) or single-step decoding (VAEs), and it produces remarkably stable training dynamics.

  2. The simplified objective works best — predicting the noise \boldsymbol{ε} with a simple MSE loss outperforms the theoretically motivated variational bound, showing that practical simplicity often beats mathematical elegance.

  3. The forward process enables direct training — the reparameterization trick lets us jump to any timestep in O(1), making training efficient despite the 1000-step Markov chain.

  4. Speed is the main weakness — 1000 sequential denoising steps makes generation slow, but this has been largely solved by DDIM and DPM-Solver, reducing the gap to 20–50 steps.

  5. Full mode coverage by construction — unlike GANs which can drop modes, diffusion models are trained on a proper likelihood objective and sample from the full data distribution.

Impact and Legacy

DDPM is the foundational paper behind the modern diffusion model revolution. While Sohl-Dickstein et al. (2015) introduced the theoretical framework, it was Ho, Jain, and Abbeel who demonstrated that diffusion models could generate images competitive with GANs — the dominant paradigm at the time.

The impact has been transformative. Stable Diffusion, DALL-E 2, Imagen, and Midjourney all build directly on the DDPM framework. Latent Diffusion Models (Rombach et al. 2022) applied the diffusion process in a compressed latent space rather than pixel space, enabling high-resolution generation with practical compute budgets. Classifier-free guidance (Ho & Salimans 2022) provided a way to control the quality-diversity trade-off, and text conditioning via CLIP or T5 embeddings enabled the text-to-image revolution.

The connection to score-based models deserves mention: Song and Ermon’s concurrent work on score matching showed that DDPM’s denoising objective is equivalent to learning the score function \mathbf{x} log p(\mathbf{x}). Song et al.’s subsequent unification via stochastic differential equations (Score SDE) showed that both perspectives are instances of the same continuous-time framework, providing a deeper theoretical foundation.

From image generation to video synthesis, audio generation, molecular design, and robotic planning — diffusion models have become the default generative framework across machine learning. DDPM’s contribution was showing that the simplest possible training objective, applied to the simplest possible noise-and-denoise pipeline, was enough to produce state-of-the-art results.

  • Attention Is All You Need — the transformer architecture used in modern diffusion model backbones like DiT
  • Vision Transformer — ViT architecture that inspired the Diffusion Transformer (DiT)
  • VICReg — another approach to learning without labels, using variance-invariance-covariance regularization
  • DINO — self-supervised vision transformers that learn representations through self-distillation

If you found this paper review helpful, consider sharing it with others.

Mastodon