High Bandwidth Memory (HBM)

High Bandwidth Memory is DRAM turned on its side: instead of spreading memory chips across a circuit board and talking to them through a narrow bus, HBM stacks dies vertically and wires them straight down into the package next to the processor. If conventional DDR is a long highway to a distant warehouse, HBM builds the warehouse next door to the factory — and connects it with a road a thousand lanes wide.

That one packaging decision is why a modern AI accelerator moves terabytes per second while your desktop moves tens of gigabytes. This page covers why HBM exists, how it's physically built, what it means for performance, and what it costs.

The Memory Wall

Processors have outrun memory for forty years, but the gap turned existential with deep learning. Between the P100 (2016) and the B200 (2024), NVIDIA's dense FP16 throughput grew roughly 106×. Over the same eight years, the memory bandwidth feeding those FLOPS grew about 11× — and that's with HBM doing the heavy lifting. Without it, the gap would be a chasm.

A processor that can't be fed doesn't compute; it waits. There are only three levers for moving more bytes per second, and each was pushed to its limit before HBM:

Approach	Example	What it buys	What it costs
More channels	12-channel DDR5 servers	~10× one channel	Board area, pin count, controller complexity
Faster pins	GDDR7 at 32 Gb/s per pin	High per-pin rates	Signal integrity, power per bit
Wider bus, on-package	HBM	1024–2048 bits wide	Advanced packaging cost and complexity

HBM is the third lever pushed to its logical conclusion: if driving signals fast across a board is expensive, move the memory into the package and make the bus absurdly wide instead. A thousand slow, short, efficient wires beat sixty-four fast, long, power-hungry ones.

What HBM Looks Like

An HBM device is a small tower: four to sixteen DRAM dies stacked on top of a base logic die, sitting beside the GPU on a slice of silicon called an interposer. Explore the package below — click each part to see what it does, or flip to the GDDR comparison to see why proximity is the whole game:

Three structural pieces make the tower work:

The DRAM dies store the bits — ordinary DRAM cells, reorganized into many independent channels (16 in HBM3, 32 in HBM4).
The base die at the bottom of the stack handles I/O, testing, and routing between the stack and the outside world.
The interposer is a passive silicon layer with wiring far finer than any circuit board — it's what makes a 1024-bit bus routable at all. The GPU and the HBM stacks all sit on it, millimeters apart.

How It's Built: TSVs and the Interposer

Stacking dies is easy; wiring them vertically is not. The enabling technology is the Through-Silicon Via (TSV) — a copper column drilled straight through a thinned die, letting signals pass vertically through the silicon itself instead of detouring out to wire bonds. Step through how one is made:

The numbers that matter:

Parameter	Typical value	Why it matters
TSV diameter	5–10 µm	Thousands fit per die without eating storage area
Die thickness	~50 µm after thinning	Short vias = low resistance and latency
TSVs per HBM stack	Several thousand	Data, command, power, all vertical
Interface width	1024-bit (2048 in HBM4)	The whole point — bandwidth via width, not clock
Micro-bump pitch	~25 µm	Die-to-die connection density

The manufacturing sequence — etch, insulate, fill with copper, polish, thin the wafer, bump, bond — has to succeed across every die in the stack. One bad TSV in one die can scrap the whole tower, which is a large part of why HBM costs what it does.

HBM Generations

Each generation has roughly doubled per-stack bandwidth, alternating between two strategies: push the pins faster, or (in HBM4's case) double the interface width.

Generation	First products	Pin speed	Interface	Bandwidth / stack	Max capacity / stack
HBM	2015	1.0 Gb/s	1024-bit	128 GB/s	4 GB
HBM2	2016	2.0 Gb/s	1024-bit	256 GB/s	8 GB
HBM2E	2020	3.6 Gb/s	1024-bit	461 GB/s	16 GB
HBM3	2022	6.4 Gb/s	1024-bit	819 GB/s	24 GB
HBM3E	2024	9.6 Gb/s	1024-bit	~1.2 TB/s	36 GB
HBM4	2026	8+ Gb/s	2048-bit	2+ TB/s	64 GB

Per-stack figures are JEDEC peak rates; shipping products configure pin speeds and stack counts to their own thermal and cost targets, so device bandwidth comes from the product datasheet, not this table. An H100 SXM carries five stacks of HBM3 for 3.35 TB/s; a B200 carries eight stacks of HBM3E for 8 TB/s — both running their stacks below the JEDEC peak. HBM4 is the exception so far: parts for 2026 accelerators are binned well past the 8 Gb/s base rate.

HBM in the GPU Memory Hierarchy

For all its bandwidth, HBM is still the bottom of the on-package hierarchy — the slowest, largest tier before data leaves the device entirely. Everything above it exists to avoid touching it:

This framing matters for programmers: HBM bandwidth is a budget, and the hierarchy is how you stay under it. Every byte loaded from HBM should be reused as many times as possible from registers and shared memory before the kernel reaches for the next one.

Memory-Bound or Compute-Bound? The Roofline

Whether HBM bandwidth limits your workload comes down to one number: arithmetic intensity — how many floating-point operations you perform per byte fetched from memory. The roofline model turns it into a picture: performance rises linearly with intensity (the slanted "memory roof") until it hits peak compute (the flat roof). The corner where they meet is the ridge point.

Try the presets: a large GEMM lives far right of the ridge, comfortably compute-bound — this is what tensor cores were built for. LLM decoding at batch size 1 sits at roughly 1 FLOP per byte: each generated token reads every weight once and does almost nothing with it. On an H100, whose ridge point is near 300 FLOP/B, that workload reaches well under 1% of peak compute. Token generation speed is HBM bandwidth, full stop — which is why memory, not FLOPS, is the headline spec of every inference accelerator.

Programming for HBM

You don't address HBM differently from any other GPU memory — cudaMalloc hands it to you. Using its bandwidth well comes down to two habits:

Coalesce accesses. Threads in a warp should touch consecutive addresses, so the hardware merges them into full-width HBM transactions instead of scattered partial ones.
Tile and reuse. Stage blocks of data in shared memory and reuse each loaded value many times, raising your kernel's arithmetic intensity toward the ridge point.

The classic tiled matrix multiply shows both — coalesced collaborative loads, then heavy reuse from shared memory:

__global__ void tiled_gemm(
    const float* __restrict__ A,
    const float* __restrict__ B,
    float* __restrict__ C,
    int M, int N, int K
) {
    const int TILE = 32;
    __shared__ float As[TILE][TILE];
    __shared__ float Bs[TILE][TILE];

    int row = blockIdx.y * TILE + threadIdx.y;
    int col = blockIdx.x * TILE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < K; t += TILE) {
        // Coalesced loads: adjacent threads read adjacent addresses,
        // so each warp triggers full-width HBM transactions.
        As[threadIdx.y][threadIdx.x] =
            (row < M && t + threadIdx.x < K) ? A[row * K + t + threadIdx.x] : 0.0f;
        Bs[threadIdx.y][threadIdx.x] =
            (col < N && t + threadIdx.y < K) ? B[(t + threadIdx.y) * N + col] : 0.0f;
        __syncthreads();

        // Each value loaded from HBM is reused TILE times from shared
        // memory — this reuse is what moves the kernel up the roofline.
        #pragma unroll
        for (int k = 0; k < TILE; k++) {
            sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
        }
        __syncthreads();
    }

    if (row < M && col < N) C[row * N + col] = sum;
}

In practice you'd reach for cuBLAS or CUTLASS rather than hand-rolling this — but every fast GEMM library is doing exactly this dance, several levels deeper.

The Price of Bandwidth: Thermal and Cost

HBM's density comes with two bills.

Heat. A stack of eight to sixteen DRAM dies shares one escape path for heat — up through the top of the stack into the same cooler as a 700 W+ GPU sitting millimeters away. DRAM also dislikes heat in a particular way: leakage increases with temperature, forcing more frequent refresh, which eats the very bandwidth you paid for. Stacks throttle in the ~95 °C range, and thermal design — not signaling — is a primary constraint on how tall future stacks can grow.

Money. Stacked dies, thousands of TSVs, interposers, and the yield risk of scrapping a whole tower over one bad die make HBM by far the most expensive mainstream memory per gigabyte:

Memory	Relative cost / GB	Bandwidth per unit	Typical home
DDR5	~1×	38–67 GB/s per channel	CPUs, servers
GDDR7	~3–4×	~112–128 GB/s per chip	Gaming and graphics cards
HBM3E	~15–25×	~1.2 TB/s per stack	AI accelerators, HPC

Costs are approximate, relative, and volatile — HBM pricing moves with AI demand cycles. The economics only close when bandwidth is the product, which is exactly the situation for training and serving large models.

HBM4 and What's Next

As of mid-2026 the picture is unusually concrete:

Shipped: The JEDEC HBM4 standard (JESD270-4, April 2025) doubles the interface to 2048 bits, restoring the original HBM move — width over clock speed. First HBM4-equipped accelerators (NVIDIA's Vera Rubin generation) are in production with all three memory vendors supplying stacks, binned above the 8 Gb/s base rate.
Announced: 16-high stacks at 64 GB each, and HBM4E pushing per-stack bandwidth further through 2027.
Research: Processing-in-memory (PIM) — putting compute inside the stack so some operations never cross the interposer at all. Samsung and SK hynix have demonstrated PIM variants for years; whether it escapes the lab at scale remains the open question.

The trajectory is consistent: every generation, the answer to the memory wall is more parallelism in packaging — wider buses, taller stacks, shorter distances.

Key Takeaways

Essential HBM Concepts

• Width over speed: 1024–2048 slow pins beat 64 fast ones

• 3D stacking: TSVs wire dies vertically through thinned silicon

• Interposer: on-package silicon wiring makes the wide bus routable

• Scaling: 128 GB/s (2015) to 2+ TB/s per stack (2026)

• Roofline: arithmetic intensity decides if HBM is your bottleneck

• LLM decode: ~1 FLOP/byte — inference speed is HBM bandwidth

• Constraints: heat and cost, not signaling, bound the design

• HBM4: doubles the interface to 2048 bits, shipping in 2026 accelerators

HBM is what happens when the memory wall meets advanced packaging: if you can't make memory faster, move it closer and make the connection wider. Every large model trained or served today runs at the speed of stacked DRAM — and the roofline model tells you exactly when that's the limit that matters.