DDPM: Denoising Diffusion Probabilistic Models

Jonathan Ho; Ajay Jain; Pieter Abbeel

TL;DR

DDPM shows that you can generate high-quality images by learning to reverse a simple noising process. Start with a clean image, add Gaussian noise step by step until it becomes pure static, then train a neural network to undo each step. The result is a generative model that rivals GANs in sample quality while being dramatically more stable to train — no adversarial dynamics, no mode collapse, just a straightforward MSE loss on predicted noise.

The Core Idea: Noise and Denoise

The central insight of diffusion models is elegantly simple: if you can systematically destroy information, you can learn to reverse that destruction.

The forward process q takes a clean data sample \mathbf{x}₀ and gradually adds Gaussian noise over T timesteps, producing a sequence \mathbf{x}₁, \mathbf{x}₂, \ldots, \mathbf{x}_T where each step makes the image slightly noisier. By the final step, \mathbf{x}_T is indistinguishable from pure Gaussian noise — all information about the original image has been erased.

The reverse process p_θ learns to undo this destruction. A neural network (typically a U-Net) is trained to take a noisy image \mathbf{x}_t and predict the noise that was added, effectively learning to denoise one step at a time. At generation time, we start from pure noise \mathbf{x}_T ∼ 𝒩(\mathbf{0}, \mathbf{I}) and iteratively apply the learned denoiser, producing progressively cleaner images until we arrive at a realistic sample \mathbf{x}₀.

Each forward step is a simple Gaussian transition:

q(\mathbf{x}_t | \mathbf{x}_t-1) = 𝒩(\mathbf{x}_t; √(1-β_t)\,\mathbf{x}_t-1,\, β_t\mathbf{I})

where β_t is a small noise variance that controls how much noise is added at step t. The signal is scaled by √(1-β_t) to keep the variance bounded, while fresh noise with variance β_t is injected.

The Forward Process: Destroying Information

The forward process is a fixed Markov chain — it has no learnable parameters. At each timestep t, the image is scaled down slightly and fresh Gaussian noise is added. The noise variance β_t follows a predetermined schedule, typically increasing linearly from β₁ = 10^-4 to β_T = 0.02 over T = 1000 steps.

A critical property makes training efficient: thanks to the reparameterization trick, we can sample \mathbf{x}_t at any arbitrary timestep directly from \mathbf{x}₀ without running through all previous steps. Define α_t = 1 - β_t and \bar{α}_t = Π_s=1^t α_s. Then:

q(\mathbf{x}_t | \mathbf{x}₀) = 𝒩(\mathbf{x}_t;\, √(\bar{α)_t}\,\mathbf{x}₀,\, (1-\bar{α}_t)\mathbf{I})

This means we can write any noisy sample as a simple linear combination:

\mathbf{x}_t = √(\bar{α)_t}\,\mathbf{x}₀ + √(1-\bar{α)_t}\,\boldsymbol{ε}, \boldsymbol{ε} ∼ 𝒩(\mathbf{0}, \mathbf{I})

The coefficient √(\bar{α)_t} controls how much of the original signal remains, while √(1-\bar{α)_t} controls the noise amplitude. As t increases, \bar{α}_t decreases toward zero, and the signal is progressively overwhelmed by noise. At t = T, \bar{α}_T ≈ 0 and the sample is essentially pure noise.

Noise Schedules: How Fast to Add Noise

The noise schedule — the sequence of β_t values across timesteps — determines how quickly information is destroyed during the forward process. This choice has a significant impact on both training efficiency and sample quality.

The linear schedule used in the original DDPM paper increases β_t linearly from 10^-4 to 0.02. This works well but has a weakness: because \bar{α}_t drops too quickly in early timesteps, the model spends most of its capacity learning to denoise heavily corrupted images. The early, lightly-noised timesteps — where fine details matter most — are relatively underrepresented.

The cosine schedule, introduced by Nichol and Dhariwal in their Improved DDPM paper (2021), addresses this by designing \bar{α}_t directly as a cosine curve:

\bar{α}_t = f(t)f(0), f(t) = cos\!(t/T + s1 + s · π2)²

where s = 0.008 is a small offset that prevents β_t from being too small near t = 0. The cosine schedule preserves signal much longer in the early timesteps, giving the model more training signal at low noise levels where perceptual quality is determined.

The Reverse Process: Learning to Denoise

The reverse process is where learning happens. Given a noisy image \mathbf{x}_t and the current timestep t, a U-Net architecture predicts the noise \boldsymbol{ε} that was added. The architecture uses sinusoidal timestep embeddings (similar to positional encodings in transformers) to condition the network on t, telling it how much noise to expect.

The U-Net is a natural fit for this task: its encoder-decoder structure with skip connections allows it to capture both global structure and fine-grained details. The encoder progressively downsamples the spatial dimensions to capture semantic content, while the decoder upsamples back to the original resolution using skip connections to preserve spatial detail. Self-attention layers at lower resolutions help the model reason about global image coherence.

The simplified training objective from the paper is a straightforward MSE loss on the predicted noise:

L_simple = 𝔼_{t, \mathbf{x}₀, \boldsymbol{ε}} [ \| \boldsymbol{ε} - \boldsymbol{ε}_θ(√(\bar{α)_t} \mathbf{x}₀ + √(1-\bar{α)_t} \boldsymbol{ε}, t) \|² ]

The expectation is over uniformly sampled timesteps t ∼ \text{Uniform}(1, T), training images \mathbf{x}₀ ∼ q(\mathbf{x}₀), and noise samples \boldsymbol{ε} ∼ 𝒩(\mathbf{0}, \mathbf{I}). This simplified objective drops the weighting term from the full variational lower bound but works better in practice — the paper found it produces higher sample quality despite being a looser bound.

There are three common ways to parameterize what the network predicts:

\boldsymbol{ε}-prediction (original DDPM): predict the noise that was added. The denoising update rule then uses this to compute \mathbf{x}_t-1.
\mathbf{x}₀-prediction: predict the clean image directly. The posterior mean formula then combines this with \mathbf{x}_t to find \mathbf{x}_t-1.
v-prediction (from progressive distillation): predict a velocity \mathbf{v} = √(\bar{α)_t}\boldsymbol{ε} - √(1-\bar{α)_t}\mathbf{x}₀, which is numerically more stable at both very low and very high noise levels.

All three parameterizations are mathematically equivalent but have different numerical properties during training.

Sampling: From Noise to Image

Once trained, generating an image follows a simple iterative procedure. Start with \mathbf{x}_T ∼ 𝒩(\mathbf{0}, \mathbf{I}) and for each timestep from T down to 1:

\mathbf{x}_t-1 = 1√(α_t)(\mathbf{x}_t - 1-α_t√(1-\bar{α)_t} \boldsymbol{ε}_θ(\mathbf{x}_t, t)) + σ_t \mathbf{z}

where \mathbf{z} ∼ 𝒩(\mathbf{0}, \mathbf{I}) for t > 1 and \mathbf{z} = \mathbf{0} for t = 1. The noise term σ_t \mathbf{z} adds stochasticity to the sampling process, which helps with diversity.

The major drawback is speed: DDPM requires all T = 1000 sequential denoising steps. This makes generation roughly 1000 times slower than a GAN or VAE, which produce samples in a single forward pass. Two key follow-up works addressed this:

DDIM (Denoising Diffusion Implicit Models, Song et al. 2021) showed that the reverse process can be reformulated as solving an ordinary differential equation (ODE). This allows deterministic sampling and — crucially — enables skipping timesteps. With only 50 steps, DDIM achieves quality comparable to 1000-step DDPM.

DPM-Solver (Lu et al. 2022) applies higher-order numerical ODE solvers (analogous to Runge-Kutta methods) to the diffusion ODE. By using second and third-order solvers, it achieves high-quality samples in as few as 10–20 steps — a 50–100x speedup over the original DDPM.

How DDPM Compares

Generative Model Comparison

How DDPM compares to other generative model families across key dimensions.

Method	Training Stability	Mode Coverage	Sample Quality	Speed	Likelihood
DDPM	Stable MSE loss	Full distribution	FID 3.17	1000 steps	Tractable ELBO
GAN	Mode collapse risk	Mode dropping	Sharp outputs	1 forward pass	Implicit density
VAE	Stable ELBO	Posterior collapse	Often blurry	1 forward pass	ELBO bound
Normalizing Flow	Exact likelihood	Bijective mapping	Architecture limits	Depends on depth	Exact log p(x)

DDPM

Training:

Stable MSE loss

Coverage:

Full distribution

Quality:

FID 3.17

Speed:

1000 steps

Likelihood:

Tractable ELBO

Iterative denoising via learned reverse Markov chain

GAN

Training:

Mode collapse risk

Coverage:

Mode dropping

Quality:

Sharp outputs

Speed:

1 forward pass

Likelihood:

Implicit density

Adversarial min-max game between generator and discriminator

VAE

Training:

Stable ELBO

Coverage:

Posterior collapse

Quality:

Often blurry

Speed:

1 forward pass

Likelihood:

ELBO bound

Encoder-decoder with latent Gaussian prior

Normalizing Flow

Training:

Exact likelihood

Coverage:

Bijective mapping

Quality:

Architecture limits

Speed:

Depends on depth

Likelihood:

Exact log p(x)

Invertible transforms with change-of-variables formula

Use DDPM when…

You need high-quality samples with full mode coverage — no mode collapse or dropping
Training stability is critical — simple MSE loss, no adversarial dynamics
You want a tractable likelihood bound for model comparison or density estimation

Consider alternatives when…

Sampling speed is paramount — GANs and VAEs generate in a single forward pass
You need real-time generation — 1000 sequential denoising steps is prohibitive
You require exact likelihoods rather than bounds — normalizing flows are better suited

Key Results

Unconditional Image Generation

On CIFAR-10, DDPM achieves an FID score of 3.17, which at the time of publication was competitive with the best GAN results while being dramatically simpler to train. On 256×256 LSUN datasets, the model produces high-quality bedroom and church images with an FID of 4.90 on LSUN Bedroom.

Connection to Variational Inference

DDPM is formally a hierarchical variational autoencoder with T latent variables. The forward process defines the approximate posterior q(\mathbf{x}_1:T | \mathbf{x}₀), and the reverse process defines the generative model p_θ(\mathbf{x}_0:T). The training objective is derived from the evidence lower bound (ELBO):

log p(\mathbf{x}₀) ≥ 𝔼_q\![log p_θ(\mathbf{x}_0:T)q(\mathbf{x}_1:T|\mathbf{x}₀)]

The simplified MSE loss used in practice is a reweighted version of this bound that drops the per-timestep weighting factors. While this makes it a looser bound, it empirically leads to better sample quality, suggesting that equal weighting across timesteps provides a more useful training signal.

Log-Likelihood

DDPM achieves a negative log-likelihood of 3.70 bits/dim on CIFAR-10, which is competitive with other likelihood-based models. The Improved DDPM paper later showed that switching to a learned variance (predicting σ_t in addition to \boldsymbol{ε}) and using the cosine schedule improves this to 2.94 bits/dim — a state-of-the-art result among likelihood-based models.

Key Takeaways

Diffusion as iterative refinement — generating images by gradually denoising from pure noise is a fundamentally different paradigm from adversarial training (GANs) or single-step decoding (VAEs), and it produces remarkably stable training dynamics.
The simplified objective works best — predicting the noise \boldsymbol{ε} with a simple MSE loss outperforms the theoretically motivated variational bound, showing that practical simplicity often beats mathematical elegance.
The forward process enables direct training — the reparameterization trick lets us jump to any timestep in O(1), making training efficient despite the 1000-step Markov chain.
Speed is the main weakness — 1000 sequential denoising steps makes generation slow, but this has been largely solved by DDIM and DPM-Solver, reducing the gap to 20–50 steps.
Full mode coverage by construction — unlike GANs which can drop modes, diffusion models are trained on a proper likelihood objective and sample from the full data distribution.

Impact and Legacy

DDPM is the foundational paper behind the modern diffusion model revolution. While Sohl-Dickstein et al. (2015) introduced the theoretical framework, it was Ho, Jain, and Abbeel who demonstrated that diffusion models could generate images competitive with GANs — the dominant paradigm at the time.

The impact has been transformative. Stable Diffusion, DALL-E 2, Imagen, and Midjourney all build directly on the DDPM framework. Latent Diffusion Models (Rombach et al. 2022) applied the diffusion process in a compressed latent space rather than pixel space, enabling high-resolution generation with practical compute budgets. Classifier-free guidance (Ho & Salimans 2022) provided a way to control the quality-diversity trade-off, and text conditioning via CLIP or T5 embeddings enabled the text-to-image revolution.

The connection to score-based models deserves mention: Song and Ermon’s concurrent work on score matching showed that DDPM’s denoising objective is equivalent to learning the score function ∇_\mathbf{x} log p(\mathbf{x}). Song et al.’s subsequent unification via stochastic differential equations (Score SDE) showed that both perspectives are instances of the same continuous-time framework, providing a deeper theoretical foundation.

From image generation to video synthesis, audio generation, molecular design, and robotic planning — diffusion models have become the default generative framework across machine learning. DDPM’s contribution was showing that the simplest possible training objective, applied to the simplest possible noise-and-denoise pipeline, was enough to produce state-of-the-art results.

GAN — the adversarial generative paradigm diffusion models largely displaced, trading GANs’ unstable minimax game for a stable denoising objective
VAE — the other classic likelihood-based generative model; diffusion can be viewed as a deep hierarchy of denoising latent-variable steps
Attention Is All You Need — the transformer architecture used in modern diffusion model backbones like DiT
Vision Transformer — ViT architecture that inspired the Diffusion Transformer (DiT)
VICReg — another approach to learning without labels, using variance-invariance-covariance regularization
DINO — self-supervised vision transformers that learn representations through self-distillation