A variational autoencoder learns to compress data into a continuous, probabilistic latent space — one smooth enough that you can walk between points and decode plausible new samples the whole way. Unlike a plain autoencoder, a VAE encodes each input to a distribution, and a regularizer keeps that space organized.
Interactive VAE latent space
The explorer below is a real VAE — a 2-D-latent model trained in PyTorch on the 1,797 real sklearn 8×8 handwritten digits. Its decoder weights run live in your browser, so every glyph is the trained network turning a latent z back into pixels. Roam the manifold, interpolate between digits, or sample the prior.
Probabilistic encoding
Instead of a single point, the encoder maps each input to a Gaussian over the latent space:
The mean μ_φ(x) is where the digit lands; the variance σ_φ2(x) is how much wiggle room it claims. In the explorer those are the dot and the dashed ring on the highlighted prototype.
The reparameterization trick
Sampling is random, and you cannot backpropagate through a random node. The fix is to move the randomness into an external variable so the gradient can flow through μ and σ:
def forward(self, x): mu, logvar = self.encode(x) z = mu + torch.exp(0.5 * logvar) * torch.randn_like(mu) # reparameterize recon = self.decode(z) bce = F.binary_cross_entropy(recon, x, reduction="sum") # reconstruction kl = -0.5 * torch.sum(1 + logvar - mu**2 - logvar.exp()) # KL to N(0,I) return bce + beta * kl # the ELBO loss
The loss: reconstruction vs. KL
VAEs maximize a lower bound on the log-likelihood (the ELBO), which splits into a term that wants faithful reconstructions and a term that regularizes the latent space toward a standard-normal prior:
For two Gaussians the KL has a closed form, which is why it is cheap to compute every step:
That KL pressure is what pulls every digit's distribution toward the origin until the classes overlap into one continuous blob — the organized space you see in the explorer.
Why the space is useful
- Continuity. Because KL forbids isolated islands, nearby latents decode to similar digits — so the straight line between two encodings is a smooth morph (the Interpolate lens).
- Generation. The prior is p(z) = 𝒩(0, I), so you can sample z from it and decode brand-new digits the model never saw (the Generate lens).
- Structure. Directions in a well-trained latent space often line up with interpretable features — stroke width, slant, style.
Practical notes
β-VAE. Scaling the KL term trades reconstruction sharpness for a more disentangled, regular space:
Posterior collapse. If the KL term dominates (or the decoder is too strong), the encoder gives up and returns the prior for every input — q_φ(z|x) ≈ p(z). KL warm-up and free-bits keep it at bay.
Blurry samples. VAEs optimize a Gaussian likelihood, which averages over plausible outputs — and a 2-D latent like the one above is a brutal bottleneck. The softness in the generated digits is that effect, not a bug. VAE-GANs and richer likelihoods sharpen it.
References
- Kingma & Welling. "Auto-Encoding Variational Bayes" (the original VAE)
- Higgins et al. "β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework"
- Rezende et al. "Stochastic Backpropagation and Approximate Inference in Deep Generative Models"
Related concepts
BatchNorm normalizes over the batch and spatial axes; LayerNorm normalizes over the channel and spatial axes for each sample. The choice changes whether your model trains stably with batch=1, depends on batch composition at inference, and behaves consistently across train and eval.
Interactive guide to convolution in CNNs: visualize sliding windows, kernels, stride, padding, and feature detection with step-by-step demos.
Understand dilated (atrous) convolutions: how dilation rates expand receptive fields exponentially without extra parameters and how to avoid gridding artifacts.
Learn how skip connections and residual learning enable training of very deep neural networks. Understand the ResNet revolution with interactive visualizations.
Explore how multi-head attention enables Vision Transformers (ViT) to process sequential data by encoding relative positions.
Explore how positional embeddings enable Vision Transformers (ViT) to process sequential data by encoding relative positions.
