TL;DR
- A VAE replaces the deterministic bottleneck of a standard autoencoder with a probabilistic one: the encoder outputs the mean μ and standard deviation σ of a Gaussian, samples a latent code
z, and the decoder reconstructs the input fromz. - Training maximizes the Evidence Lower Bound (ELBO): a reconstruction term that pushes the decoder to reproduce the input, minus a KL divergence that regularizes the latent distribution toward 𝒩(0, I).
- The reparameterization trick — rewriting
z = μ + σ ⊙ εwith \varepsilon ∼ 𝒩(0, I) — is the key that lets gradients flow through the sampling step so the model can be trained end-to-end. - The KL penalty produces a smooth, continuous latent space: you can interpolate between two codes and the decoder produces semantically smooth outputs, which is impossible in a vanilla autoencoder.
Encoder, latent, decoder
A classical autoencoder learns a deterministic mapping from input to code to reconstruction. The problem: the latent space has no guaranteed structure, so sampling an arbitrary point usually produces garbage. The VAE fixes this by making the encoder probabilistic.
Given an input x, the encoder outputs two vectors — a mean μ(x) and a log-variance logσ2(x) — that define a Gaussian q_φ(z|x) = 𝒩(z; μ, σ2 I). A latent code z is sampled from this distribution and passed to the decoder, which reconstructs x̂ ≈ x.
The training objective is the ELBO (Evidence Lower Bound):
The reconstruction term rewards the decoder for accurately recovering x. The KL term penalises the approximate posterior q_φ(z|x) for drifting away from the prior p(z) = 𝒩(0,I). For a diagonal Gaussian encoder the KL has a closed form:
The reparameterization trick
Sampling z ∼ 𝒩(μ, σ2) is a stochastic operation: the gradient cannot pass through a random draw. This is a problem because we need gradients to flow back to the encoder parameters φ through the sampled z.
The reparameterization trick sidesteps this by separating the randomness from the learnable parameters:
Now z is a deterministic function of μ, σ, and an exogenous noise \varepsilon. The stochasticity lives in \varepsilon, which has no learnable parameters, so gradients can flow freely through μ and σ via the chain rule. Without this trick, training the inference network end-to-end would require high-variance score-function estimators.
A structured latent space
Because the KL term pulls every encoder distribution toward 𝒩(0, I), the aggregate posterior across all training inputs fills the unit Gaussian. This has a crucial consequence: the latent space is smooth and continuous. Nearby codes decode to similar outputs, and straight-line interpolation between two codes produces a sensible trajectory through the data manifold.
In a standard autoencoder the encoder can pack codes anywhere without penalty. Adjacent codes may map to completely unrelated outputs, so random sampling or interpolation fails. The KL regularization in the VAE is precisely what prevents this collapse.
Why it mattered
VAEs were the first deep generative models with a principled, differentiable inference procedure. Earlier approaches like restricted Boltzmann machines required MCMC-based approximate inference; GANs (proposed the same year) avoid inference entirely at the cost of training instability and no explicit likelihood. The VAE gave practitioners both: a stable training objective (ELBO) and an inference network that runs in a single forward pass at test time.
The latent space idea has proven foundational. Latent Diffusion Models (Stable Diffusion) use a VAE to compress images to a compact latent before running the expensive diffusion process — the VAE is precisely the stage that makes high-resolution diffusion practical on consumer hardware. The conceptual framework of encoding inputs into a structured Gaussian latent and decoding from that latent has become a standard building block across generative modelling, representation learning, and controllable synthesis.
Related Reading
- Latent Diffusion — uses a VAE as its first stage to compress images into a latent space where diffusion is computationally feasible
- DDPM — denoising diffusion probabilistic models, the other major generative paradigm that competes with and complements VAEs
- GAN — generative adversarial networks, proposed the same year, offering an adversarial alternative without explicit likelihoods
