Skip to main content

High Bandwidth Memory (HBM)

Summary
How HBM works: 3D-stacked DRAM, TSVs, and silicon interposers explained with interactive visualizations — from the memory wall to HBM4 and the roofline model.

High Bandwidth Memory is DRAM turned on its side: instead of spreading memory chips across a circuit board and talking to them through a narrow bus, HBM stacks dies vertically and wires them straight down into the package next to the processor. If conventional DDR is a long highway to a distant warehouse, HBM builds the warehouse next door to the factory — and connects it with a road a thousand lanes wide.

That one packaging decision is why a modern AI accelerator moves terabytes per second while your desktop moves tens of gigabytes. This page covers why HBM exists, how it's physically built, what it means for performance, and what it costs.

The Memory Wall

Processors have outrun memory for forty years, but the gap turned existential with deep learning. Between the P100 (2016) and the B200 (2024), NVIDIA's dense FP16 throughput grew roughly 106×. Over the same eight years, the memory bandwidth feeding those FLOPS grew about 11× — and that's with HBM doing the heavy lifting. Without it, the gap would be a chasm.

A processor that can't be fed doesn't compute; it waits. There are only three levers for moving more bytes per second, and each was pushed to its limit before HBM:

ApproachExampleWhat it buysWhat it costs
More channels12-channel DDR5 servers~10× one channelBoard area, pin count, controller complexity
Faster pinsGDDR7 at 32 Gb/s per pinHigh per-pin ratesSignal integrity, power per bit
Wider bus, on-packageHBM1024–2048 bits wideAdvanced packaging cost and complexity

HBM is the third lever pushed to its logical conclusion: if driving signals fast across a board is expensive, move the memory into the package and make the bus absurdly wide instead. A thousand slow, short, efficient wires beat sixty-four fast, long, power-hungry ones.

What HBM Looks Like

An HBM device is a small tower: four to sixteen DRAM dies stacked on top of a base logic die, sitting beside the GPU on a slice of silicon called an interposer. Explore the package below — click each part to see what it does, or flip to the GDDR comparison to see why proximity is the whole game:

Three structural pieces make the tower work:

  1. The DRAM dies store the bits — ordinary DRAM cells, reorganized into many independent channels (16 in HBM3, 32 in HBM4).
  2. The base die at the bottom of the stack handles I/O, testing, and routing between the stack and the outside world.
  3. The interposer is a passive silicon layer with wiring far finer than any circuit board — it's what makes a 1024-bit bus routable at all. The GPU and the HBM stacks all sit on it, millimeters apart.

How It's Built: TSVs and the Interposer

Stacking dies is easy; wiring them vertically is not. The enabling technology is the Through-Silicon Via (TSV) — a copper column drilled straight through a thinned die, letting signals pass vertically through the silicon itself instead of detouring out to wire bonds. Step through how one is made:

The numbers that matter:

ParameterTypical valueWhy it matters
TSV diameter5–10 µmThousands fit per die without eating storage area
Die thickness~50 µm after thinningShort vias = low resistance and latency
TSVs per HBM stackSeveral thousandData, command, power, all vertical
Interface width1024-bit (2048 in HBM4)The whole point — bandwidth via width, not clock
Micro-bump pitch~25 µmDie-to-die connection density

The manufacturing sequence — etch, insulate, fill with copper, polish, thin the wafer, bump, bond — has to succeed across every die in the stack. One bad TSV in one die can scrap the whole tower, which is a large part of why HBM costs what it does.

HBM Generations

Each generation has roughly doubled per-stack bandwidth, alternating between two strategies: push the pins faster, or (in HBM4's case) double the interface width.

GenerationFirst productsPin speedInterfaceBandwidth / stackMax capacity / stack
HBM20151.0 Gb/s1024-bit128 GB/s4 GB
HBM220162.0 Gb/s1024-bit256 GB/s8 GB
HBM2E20203.6 Gb/s1024-bit461 GB/s16 GB
HBM320226.4 Gb/s1024-bit819 GB/s24 GB
HBM3E20249.6 Gb/s1024-bit~1.2 TB/s36 GB
HBM420268+ Gb/s2048-bit2+ TB/s64 GB

Per-stack figures are JEDEC peak rates; shipping products configure pin speeds and stack counts to their own thermal and cost targets, so device bandwidth comes from the product datasheet, not this table. An H100 SXM carries five stacks of HBM3 for 3.35 TB/s; a B200 carries eight stacks of HBM3E for 8 TB/s — both running their stacks below the JEDEC peak. HBM4 is the exception so far: parts for 2026 accelerators are binned well past the 8 Gb/s base rate.

HBM in the GPU Memory Hierarchy

For all its bandwidth, HBM is still the bottom of the on-package hierarchy — the slowest, largest tier before data leaves the device entirely. Everything above it exists to avoid touching it:

This framing matters for programmers: HBM bandwidth is a budget, and the hierarchy is how you stay under it. Every byte loaded from HBM should be reused as many times as possible from registers and shared memory before the kernel reaches for the next one.

Memory-Bound or Compute-Bound? The Roofline

Whether HBM bandwidth limits your workload comes down to one number: arithmetic intensity — how many floating-point operations you perform per byte fetched from memory. The roofline model turns it into a picture: performance rises linearly with intensity (the slanted "memory roof") until it hits peak compute (the flat roof). The corner where they meet is the ridge point.

Try the presets: a large GEMM lives far right of the ridge, comfortably compute-bound — this is what tensor cores were built for. LLM decoding at batch size 1 sits at roughly 1 FLOP per byte: each generated token reads every weight once and does almost nothing with it. On an H100, whose ridge point is near 300 FLOP/B, that workload reaches well under 1% of peak compute. Token generation speed is HBM bandwidth, full stop — which is why memory, not FLOPS, is the headline spec of every inference accelerator.

Programming for HBM

You don't address HBM differently from any other GPU memory — cudaMalloc hands it to you. Using its bandwidth well comes down to two habits:

  • Coalesce accesses. Threads in a warp should touch consecutive addresses, so the hardware merges them into full-width HBM transactions instead of scattered partial ones.
  • Tile and reuse. Stage blocks of data in shared memory and reuse each loaded value many times, raising your kernel's arithmetic intensity toward the ridge point.

The classic tiled matrix multiply shows both — coalesced collaborative loads, then heavy reuse from shared memory:

__global__ void tiled_gemm( const float* __restrict__ A, const float* __restrict__ B, float* __restrict__ C, int M, int N, int K ) { const int TILE = 32; __shared__ float As[TILE][TILE]; __shared__ float Bs[TILE][TILE]; int row = blockIdx.y * TILE + threadIdx.y; int col = blockIdx.x * TILE + threadIdx.x; float sum = 0.0f; for (int t = 0; t < K; t += TILE) { // Coalesced loads: adjacent threads read adjacent addresses, // so each warp triggers full-width HBM transactions. As[threadIdx.y][threadIdx.x] = (row < M && t + threadIdx.x < K) ? A[row * K + t + threadIdx.x] : 0.0f; Bs[threadIdx.y][threadIdx.x] = (col < N && t + threadIdx.y < K) ? B[(t + threadIdx.y) * N + col] : 0.0f; __syncthreads(); // Each value loaded from HBM is reused TILE times from shared // memory — this reuse is what moves the kernel up the roofline. #pragma unroll for (int k = 0; k < TILE; k++) { sum += As[threadIdx.y][k] * Bs[k][threadIdx.x]; } __syncthreads(); } if (row < M && col < N) C[row * N + col] = sum; }

In practice you'd reach for cuBLAS or CUTLASS rather than hand-rolling this — but every fast GEMM library is doing exactly this dance, several levels deeper.

The Price of Bandwidth: Thermal and Cost

HBM's density comes with two bills.

Heat. A stack of eight to sixteen DRAM dies shares one escape path for heat — up through the top of the stack into the same cooler as a 700 W+ GPU sitting millimeters away. DRAM also dislikes heat in a particular way: leakage increases with temperature, forcing more frequent refresh, which eats the very bandwidth you paid for. Stacks throttle in the ~95 °C range, and thermal design — not signaling — is a primary constraint on how tall future stacks can grow.

Money. Stacked dies, thousands of TSVs, interposers, and the yield risk of scrapping a whole tower over one bad die make HBM by far the most expensive mainstream memory per gigabyte:

MemoryRelative cost / GBBandwidth per unitTypical home
DDR5~1×38–67 GB/s per channelCPUs, servers
GDDR7~3–4×~112–128 GB/s per chipGaming and graphics cards
HBM3E~15–25×~1.2 TB/s per stackAI accelerators, HPC

Costs are approximate, relative, and volatile — HBM pricing moves with AI demand cycles. The economics only close when bandwidth is the product, which is exactly the situation for training and serving large models.

HBM4 and What's Next

As of mid-2026 the picture is unusually concrete:

  • Shipped: The JEDEC HBM4 standard (JESD270-4, April 2025) doubles the interface to 2048 bits, restoring the original HBM move — width over clock speed. First HBM4-equipped accelerators (NVIDIA's Vera Rubin generation) are in production with all three memory vendors supplying stacks, binned above the 8 Gb/s base rate.
  • Announced: 16-high stacks at 64 GB each, and HBM4E pushing per-stack bandwidth further through 2027.
  • Research: Processing-in-memory (PIM) — putting compute inside the stack so some operations never cross the interposer at all. Samsung and SK hynix have demonstrated PIM variants for years; whether it escapes the lab at scale remains the open question.

The trajectory is consistent: every generation, the answer to the memory wall is more parallelism in packaging — wider buses, taller stacks, shorter distances.

Further Reading

GPU & High-Performance Computing
GPU Memory Hierarchy & Optimization

Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance

Language & Framework Internals
Pinned Memory and DMA Transfers in PyTorch

Complete guide to PyTorch pin_memory — how DMA transfers work, when pinning helps vs hurts, NUMA effects, profiling with torch.profiler, num_workers interaction, and debugging slow data loading.

Systems & Architecture
SoA vs AoS: Data Layout Optimization

Master Structure of Arrays (SoA) vs Array of Structures (AoS) data layouts for optimal cache efficiency, SIMD vectorization, and GPU memory coalescing.

GPU & High-Performance Computing
Understanding CUDA Contexts

Explore the concept of CUDA contexts, their role in managing GPU resources, and how they enable parallel execution across multiple CPU threads.

GPU & High-Performance Computing
CUDA Context vs Streams vs MPS: Process Isolation, Concurrency, and Multi-Tenancy

How CUDA contexts, streams, and MPS compare: a context is a per-process container of GPU state, a stream is an in-order queue inside a context, and MPS lets multiple processes share a single GPU concurrently. Three layers, three different problems.

GPU & High-Performance Computing
CUDA Multi-Process Service (MPS): GPU Sharing for Concurrent Workloads

Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.

If you found this explanation helpful, consider sharing it with others.

Mastodon