TL;DR
- LLM serving is bottlenecked by the KV cache, whose size grows with every generated token and every concurrent request.
- Reserving a contiguous max-length KV region per request wastes 60–80% of memory to fragmentation.
- PagedAttention stores the KV cache in fixed-size blocks mapped through a block table, exactly like OS virtual memory — near-zero fragmentation and on-demand allocation.
- Blocks can be shared across requests (common prefixes) with copy-on-write, and the higher memory efficiency translates directly into higher serving throughput (the vLLM engine).
The fragmentation problem
The KV cache stores the keys and values of every past token so the model doesn’t recompute them. Classic serving reserves one contiguous buffer per request sized to the maximum possible length — but most requests finish early, leaving huge reserved-but-unused regions (internal fragmentation) and unusable gaps between requests (external fragmentation).
Paging the KV cache
PagedAttention borrows the operating-system solution: split memory into fixed-size blocks and let a block table map each sequence’s contiguous logical blocks to arbitrary physical blocks. Blocks are allocated on demand as the sequence grows, so nothing is reserved before it’s needed.
Sharing prefixes with copy-on-write
Because addressing goes through the block table, two requests with the same prompt prefix can point at the same physical blocks. Reference counting tracks the sharing; when a shared block must change, it is copied first (copy-on-write). Common system prompts and few-shot examples are then stored once, not once per request.
Why it mattered
PagedAttention reframed KV-cache management as a memory-systems problem and imported decades of OS wisdom. The payoff was large: by nearly eliminating fragmentation and enabling sharing, vLLM raised serving throughput several-fold over prior systems at the same latency. It is now the default memory manager in vLLM and has been adopted across the LLM-serving stack.
Related Reading
- FlashAttention — the training/prefill-side memory optimization that pairs with PagedAttention’s serving-side one
- Optimizing Transformer Inference — the broader inference-efficiency survey PagedAttention slots into
- Attention Is All You Need — the architecture whose KV cache PagedAttention manages
- Data Movement Is All You Need — the memory-bound view of transformers that explains why KV-cache management dominates serving cost
