VAE Latent Space: Understanding Variational Autoencoders

A variational autoencoder learns to compress data into a continuous, probabilistic latent space — one smooth enough that you can walk between points and decode plausible new samples the whole way. Unlike a plain autoencoder, a VAE encodes each input to a distribution, and a regularizer keeps that space organized.

Interactive VAE latent space

The explorer below is a real VAE — a 2-D-latent model trained in PyTorch on the 1,797 real sklearn 8×8 handwritten digits. Its decoder weights run live in your browser, so every glyph is the trained network turning a latent z back into pixels. Roam the manifold, interpolate between digits, or sample the prior.

Probabilistic encoding

Instead of a single point, the encoder maps each input to a Gaussian over the latent space:

q_φ(z \mid x) = 𝒩\big(μ_φ(x),\; σ_φ²(x)\big)

The mean μ_φ(x) is where the digit lands; the variance σ_φ²(x) is how much wiggle room it claims. In the explorer those are the dot and the dashed ring on the highlighted prototype.

The reparameterization trick

Sampling is random, and you cannot backpropagate through a random node. The fix is to move the randomness into an external variable so the gradient can flow through μ and σ:

z = μ + σ ⊙ ε, \qquad ε ∼ 𝒩(0, I)

def forward(self, x):
    mu, logvar = self.encode(x)
    z = mu + torch.exp(0.5 * logvar) * torch.randn_like(mu)  # reparameterize
    recon = self.decode(z)
    bce = F.binary_cross_entropy(recon, x, reduction="sum")  # reconstruction
    kl  = -0.5 * torch.sum(1 + logvar - mu**2 - logvar.exp())  # KL to N(0,I)
    return bce + beta * kl                                     # the ELBO loss

The loss: reconstruction vs. KL

VAEs maximize a lower bound on the log-likelihood (the ELBO), which splits into a term that wants faithful reconstructions and a term that regularizes the latent space toward a standard-normal prior:

ℒ = \underbrace{𝔼_{q_φ(z|x)}[log p_θ(x|z)]}_{\text{reconstruction}} \;-\; \underbrace{D_KL\big(q_φ(z|x)\,\|\,p(z)\big)}_{\text{regularization}}

For two Gaussians the KL has a closed form, which is why it is cheap to compute every step:

D_KL\big(𝒩(μ,σ²)\,\|\,𝒩(0,1)\big) = \tfrac{1}{2}Σ_j \big(μ_j² + σ_j² - log σ_j² - 1\big)

That KL pressure is what pulls every digit's distribution toward the origin until the classes overlap into one continuous blob — the organized space you see in the explorer.

Why the space is useful

Continuity. Because KL forbids isolated islands, nearby latents decode to similar digits — so the straight line between two encodings is a smooth morph (the Interpolate lens).
Generation. The prior is p(z) = 𝒩(0, I), so you can sample z from it and decode brand-new digits the model never saw (the Generate lens).
Structure. Directions in a well-trained latent space often line up with interpretable features — stroke width, slant, style.

Practical notes

β-VAE. Scaling the KL term trades reconstruction sharpness for a more disentangled, regular space:

ℒ_β = 𝔼_{q_φ(z|x)}[log p_θ(x|z)] - β · D_KL\big(q_φ(z|x)\,\|\,p(z)\big)

Posterior collapse. If the KL term dominates (or the decoder is too strong), the encoder gives up and returns the prior for every input — q_φ(z|x) ≈ p(z). KL warm-up and free-bits keep it at bay.

Blurry samples. VAEs optimize a Gaussian likelihood, which averages over plausible outputs — and a 2-D latent like the one above is a brutal bottleneck. The softness in the generated digits is that effect, not a bug. VAE-GANs and richer likelihoods sharpen it.

References

Kingma & Welling. "Auto-Encoding Variational Bayes" (the original VAE)
Higgins et al. "β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework"
Rezende et al. "Stochastic Backpropagation and Approximate Inference in Deep Generative Models"

Deep Learning

Batch Norm vs Layer Norm: When to Use Which

BatchNorm normalizes over the batch and spatial axes; LayerNorm normalizes over the channel and spatial axes for each sample. The choice changes whether your model trains stably with batch=1, depends on batch composition at inference, and behaves consistently across train and eval.

Deep Learning

Convolution Operation: The Foundation of CNNs

Interactive guide to convolution in CNNs: visualize sliding windows, kernels, stride, padding, and feature detection with step-by-step demos.

Deep Learning

Dilated Convolutions: Expanding Receptive Fields Efficiently

Understand dilated (atrous) convolutions: how dilation rates expand receptive fields exponentially without extra parameters and how to avoid gridding artifacts.

Deep Learning

Skip Connections in Neural Networks

Learn how skip connections and residual learning enable training of very deep neural networks. Understand the ResNet revolution with interactive visualizations.

Transformers & LLMs

Multi-Head Attention

How multi-head attention runs scaled dot-product attention in parallel across several representation subspaces to build context-aware token embeddings.

Transformers & LLMs

Positional Embeddings in Vision Transformers

Explore how positional embeddings enable Vision Transformers (ViT) to process sequential data by encoding relative positions.