Skip to main content

VAE Latent Space: Understanding Variational Autoencoders

Summary
Explore VAE latent space in deep learning. Learn variational autoencoder encoding, decoding, interpolation, and the reparameterization trick.

A variational autoencoder learns to compress data into a continuous, probabilistic latent space — one smooth enough that you can walk between points and decode plausible new samples the whole way. Unlike a plain autoencoder, a VAE encodes each input to a distribution, and a regularizer keeps that space organized.

Interactive VAE latent space

The explorer below is a real VAE — a 2-D-latent model trained in PyTorch on the 1,797 real sklearn 8×8 handwritten digits. Its decoder weights run live in your browser, so every glyph is the trained network turning a latent z back into pixels. Roam the manifold, interpolate between digits, or sample the prior.

Probabilistic encoding

Instead of a single point, the encoder maps each input to a Gaussian over the latent space:

q_φ(z \mid x) = 𝒩\big(μ_φ(x),\; σ_φ2(x)\big)

The mean μ_φ(x) is where the digit lands; the variance σ_φ2(x) is how much wiggle room it claims. In the explorer those are the dot and the dashed ring on the highlighted prototype.

The reparameterization trick

Sampling is random, and you cannot backpropagate through a random node. The fix is to move the randomness into an external variable so the gradient can flow through μ and σ:

z = μ + σ ⊙ ε, \qquad ε ∼ 𝒩(0, I)
def forward(self, x): mu, logvar = self.encode(x) z = mu + torch.exp(0.5 * logvar) * torch.randn_like(mu) # reparameterize recon = self.decode(z) bce = F.binary_cross_entropy(recon, x, reduction="sum") # reconstruction kl = -0.5 * torch.sum(1 + logvar - mu**2 - logvar.exp()) # KL to N(0,I) return bce + beta * kl # the ELBO loss

The loss: reconstruction vs. KL

VAEs maximize a lower bound on the log-likelihood (the ELBO), which splits into a term that wants faithful reconstructions and a term that regularizes the latent space toward a standard-normal prior:

ℒ = \underbrace{𝔼q_φ(z|x)[log p_θ(x|z)]}\text{reconstruction} \;-\; \underbrace{DKL\big(q_φ(z|x)\,\|\,p(z)\big)}\text{regularization}

For two Gaussians the KL has a closed form, which is why it is cheap to compute every step:

DKL\big(𝒩(μ,σ2)\,\|\,𝒩(0,1)\big) = \tfrac{1}{2}Σj \big(μj2 + σj2 - log σj2 - 1\big)

That KL pressure is what pulls every digit's distribution toward the origin until the classes overlap into one continuous blob — the organized space you see in the explorer.

Why the space is useful

  • Continuity. Because KL forbids isolated islands, nearby latents decode to similar digits — so the straight line between two encodings is a smooth morph (the Interpolate lens).
  • Generation. The prior is p(z) = 𝒩(0, I), so you can sample z from it and decode brand-new digits the model never saw (the Generate lens).
  • Structure. Directions in a well-trained latent space often line up with interpretable features — stroke width, slant, style.

Practical notes

β-VAE. Scaling the KL term trades reconstruction sharpness for a more disentangled, regular space:

ℒ_β = 𝔼q_φ(z|x)[log p_θ(x|z)] - β · DKL\big(q_φ(z|x)\,\|\,p(z)\big)

Posterior collapse. If the KL term dominates (or the decoder is too strong), the encoder gives up and returns the prior for every input — q_φ(z|x) ≈ p(z). KL warm-up and free-bits keep it at bay.

Blurry samples. VAEs optimize a Gaussian likelihood, which averages over plausible outputs — and a 2-D latent like the one above is a brutal bottleneck. The softness in the generated digits is that effect, not a bug. VAE-GANs and richer likelihoods sharpen it.

References

  • Kingma & Welling. "Auto-Encoding Variational Bayes" (the original VAE)
  • Higgins et al. "β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework"
  • Rezende et al. "Stochastic Backpropagation and Approximate Inference in Deep Generative Models"

If you found this explanation helpful, consider sharing it with others.

Mastodon