High Bandwidth Memory is DRAM turned on its side: instead of spreading memory chips across a circuit board and talking to them through a narrow bus, HBM stacks dies vertically and wires them straight down into the package next to the processor. If conventional DDR is a long highway to a distant warehouse, HBM builds the warehouse next door to the factory — and connects it with a road a thousand lanes wide.
That one packaging decision is why a modern AI accelerator moves terabytes per second while your desktop moves tens of gigabytes. This page covers why HBM exists, how it's physically built, what it means for performance, and what it costs.
The Memory Wall
Processors have outrun memory for forty years, but the gap turned existential with deep learning. Between the P100 (2016) and the B200 (2024), NVIDIA's dense FP16 throughput grew roughly 106×. Over the same eight years, the memory bandwidth feeding those FLOPS grew about 11× — and that's with HBM doing the heavy lifting. Without it, the gap would be a chasm.
A processor that can't be fed doesn't compute; it waits. There are only three levers for moving more bytes per second, and each was pushed to its limit before HBM:
| Approach | Example | What it buys | What it costs |
|---|---|---|---|
| More channels | 12-channel DDR5 servers | ~10× one channel | Board area, pin count, controller complexity |
| Faster pins | GDDR7 at 32 Gb/s per pin | High per-pin rates | Signal integrity, power per bit |
| Wider bus, on-package | HBM | 1024–2048 bits wide | Advanced packaging cost and complexity |
HBM is the third lever pushed to its logical conclusion: if driving signals fast across a board is expensive, move the memory into the package and make the bus absurdly wide instead. A thousand slow, short, efficient wires beat sixty-four fast, long, power-hungry ones.
What HBM Looks Like
An HBM device is a small tower: four to sixteen DRAM dies stacked on top of a base logic die, sitting beside the GPU on a slice of silicon called an interposer. Explore the package below — click each part to see what it does, or flip to the GDDR comparison to see why proximity is the whole game:
Three structural pieces make the tower work:
- The DRAM dies store the bits — ordinary DRAM cells, reorganized into many independent channels (16 in HBM3, 32 in HBM4).
- The base die at the bottom of the stack handles I/O, testing, and routing between the stack and the outside world.
- The interposer is a passive silicon layer with wiring far finer than any circuit board — it's what makes a 1024-bit bus routable at all. The GPU and the HBM stacks all sit on it, millimeters apart.
How It's Built: TSVs and the Interposer
Stacking dies is easy; wiring them vertically is not. The enabling technology is the Through-Silicon Via (TSV) — a copper column drilled straight through a thinned die, letting signals pass vertically through the silicon itself instead of detouring out to wire bonds. Step through how one is made:
The numbers that matter:
| Parameter | Typical value | Why it matters |
|---|---|---|
| TSV diameter | 5–10 µm | Thousands fit per die without eating storage area |
| Die thickness | ~50 µm after thinning | Short vias = low resistance and latency |
| TSVs per HBM stack | Several thousand | Data, command, power, all vertical |
| Interface width | 1024-bit (2048 in HBM4) | The whole point — bandwidth via width, not clock |
| Micro-bump pitch | ~25 µm | Die-to-die connection density |
The manufacturing sequence — etch, insulate, fill with copper, polish, thin the wafer, bump, bond — has to succeed across every die in the stack. One bad TSV in one die can scrap the whole tower, which is a large part of why HBM costs what it does.
HBM Generations
Each generation has roughly doubled per-stack bandwidth, alternating between two strategies: push the pins faster, or (in HBM4's case) double the interface width.
| Generation | First products | Pin speed | Interface | Bandwidth / stack | Max capacity / stack |
|---|---|---|---|---|---|
| HBM | 2015 | 1.0 Gb/s | 1024-bit | 128 GB/s | 4 GB |
| HBM2 | 2016 | 2.0 Gb/s | 1024-bit | 256 GB/s | 8 GB |
| HBM2E | 2020 | 3.6 Gb/s | 1024-bit | 461 GB/s | 16 GB |
| HBM3 | 2022 | 6.4 Gb/s | 1024-bit | 819 GB/s | 24 GB |
| HBM3E | 2024 | 9.6 Gb/s | 1024-bit | ~1.2 TB/s | 36 GB |
| HBM4 | 2026 | 8+ Gb/s | 2048-bit | 2+ TB/s | 64 GB |
Per-stack figures are JEDEC peak rates; shipping products configure pin speeds and stack counts to their own thermal and cost targets, so device bandwidth comes from the product datasheet, not this table. An H100 SXM carries five stacks of HBM3 for 3.35 TB/s; a B200 carries eight stacks of HBM3E for 8 TB/s — both running their stacks below the JEDEC peak. HBM4 is the exception so far: parts for 2026 accelerators are binned well past the 8 Gb/s base rate.
HBM in the GPU Memory Hierarchy
For all its bandwidth, HBM is still the bottom of the on-package hierarchy — the slowest, largest tier before data leaves the device entirely. Everything above it exists to avoid touching it:
This framing matters for programmers: HBM bandwidth is a budget, and the hierarchy is how you stay under it. Every byte loaded from HBM should be reused as many times as possible from registers and shared memory before the kernel reaches for the next one.
Memory-Bound or Compute-Bound? The Roofline
Whether HBM bandwidth limits your workload comes down to one number: arithmetic intensity — how many floating-point operations you perform per byte fetched from memory. The roofline model turns it into a picture: performance rises linearly with intensity (the slanted "memory roof") until it hits peak compute (the flat roof). The corner where they meet is the ridge point.
Try the presets: a large GEMM lives far right of the ridge, comfortably compute-bound — this is what tensor cores were built for. LLM decoding at batch size 1 sits at roughly 1 FLOP per byte: each generated token reads every weight once and does almost nothing with it. On an H100, whose ridge point is near 300 FLOP/B, that workload reaches well under 1% of peak compute. Token generation speed is HBM bandwidth, full stop — which is why memory, not FLOPS, is the headline spec of every inference accelerator.
Programming for HBM
You don't address HBM differently from any other GPU memory — cudaMalloc hands it to you. Using its bandwidth well comes down to two habits:
- Coalesce accesses. Threads in a warp should touch consecutive addresses, so the hardware merges them into full-width HBM transactions instead of scattered partial ones.
- Tile and reuse. Stage blocks of data in shared memory and reuse each loaded value many times, raising your kernel's arithmetic intensity toward the ridge point.
The classic tiled matrix multiply shows both — coalesced collaborative loads, then heavy reuse from shared memory:
__global__ void tiled_gemm( const float* __restrict__ A, const float* __restrict__ B, float* __restrict__ C, int M, int N, int K ) { const int TILE = 32; __shared__ float As[TILE][TILE]; __shared__ float Bs[TILE][TILE]; int row = blockIdx.y * TILE + threadIdx.y; int col = blockIdx.x * TILE + threadIdx.x; float sum = 0.0f; for (int t = 0; t < K; t += TILE) { // Coalesced loads: adjacent threads read adjacent addresses, // so each warp triggers full-width HBM transactions. As[threadIdx.y][threadIdx.x] = (row < M && t + threadIdx.x < K) ? A[row * K + t + threadIdx.x] : 0.0f; Bs[threadIdx.y][threadIdx.x] = (col < N && t + threadIdx.y < K) ? B[(t + threadIdx.y) * N + col] : 0.0f; __syncthreads(); // Each value loaded from HBM is reused TILE times from shared // memory — this reuse is what moves the kernel up the roofline. #pragma unroll for (int k = 0; k < TILE; k++) { sum += As[threadIdx.y][k] * Bs[k][threadIdx.x]; } __syncthreads(); } if (row < M && col < N) C[row * N + col] = sum; }
In practice you'd reach for cuBLAS or CUTLASS rather than hand-rolling this — but every fast GEMM library is doing exactly this dance, several levels deeper.
The Price of Bandwidth: Thermal and Cost
HBM's density comes with two bills.
Heat. A stack of eight to sixteen DRAM dies shares one escape path for heat — up through the top of the stack into the same cooler as a 700 W+ GPU sitting millimeters away. DRAM also dislikes heat in a particular way: leakage increases with temperature, forcing more frequent refresh, which eats the very bandwidth you paid for. Stacks throttle in the ~95 °C range, and thermal design — not signaling — is a primary constraint on how tall future stacks can grow.
Money. Stacked dies, thousands of TSVs, interposers, and the yield risk of scrapping a whole tower over one bad die make HBM by far the most expensive mainstream memory per gigabyte:
| Memory | Relative cost / GB | Bandwidth per unit | Typical home |
|---|---|---|---|
| DDR5 | ~1× | 38–67 GB/s per channel | CPUs, servers |
| GDDR7 | ~3–4× | ~112–128 GB/s per chip | Gaming and graphics cards |
| HBM3E | ~15–25× | ~1.2 TB/s per stack | AI accelerators, HPC |
Costs are approximate, relative, and volatile — HBM pricing moves with AI demand cycles. The economics only close when bandwidth is the product, which is exactly the situation for training and serving large models.
HBM4 and What's Next
As of mid-2026 the picture is unusually concrete:
- Shipped: The JEDEC HBM4 standard (JESD270-4, April 2025) doubles the interface to 2048 bits, restoring the original HBM move — width over clock speed. First HBM4-equipped accelerators (NVIDIA's Vera Rubin generation) are in production with all three memory vendors supplying stacks, binned above the 8 Gb/s base rate.
- Announced: 16-high stacks at 64 GB each, and HBM4E pushing per-stack bandwidth further through 2027.
- Research: Processing-in-memory (PIM) — putting compute inside the stack so some operations never cross the interposer at all. Samsung and SK hynix have demonstrated PIM variants for years; whether it escapes the lab at scale remains the open question.
The trajectory is consistent: every generation, the answer to the memory wall is more parallelism in packaging — wider buses, taller stacks, shorter distances.
Further Reading
- JEDEC JESD270-4 HBM4 Standard - the official HBM4 announcement and spec overview
- What Every Programmer Should Know About Memory - Ulrich Drepper's classic; the DRAM fundamentals underneath every HBM stack
- NVIDIA Hopper Architecture In-Depth - how a flagship GPU integrates HBM3, from the architects
Related concepts
Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance
Complete guide to PyTorch pin_memory — how DMA transfers work, when pinning helps vs hurts, NUMA effects, profiling with torch.profiler, num_workers interaction, and debugging slow data loading.
Master Structure of Arrays (SoA) vs Array of Structures (AoS) data layouts for optimal cache efficiency, SIMD vectorization, and GPU memory coalescing.
Explore the concept of CUDA contexts, their role in managing GPU resources, and how they enable parallel execution across multiple CPU threads.
How CUDA contexts, streams, and MPS compare: a context is a per-process container of GPU state, a stream is an in-order queue inside a context, and MPS lets multiple processes share a single GPU concurrently. Three layers, three different problems.
Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.
