Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon; Zhuohan Li; Siyuan Zhuang; Ying Sheng; Lianmin Zheng; Cody Hao Yu; Joseph E. Gonzalez; Hao Zhang; Ion Stoica

TL;DR

LLM serving is bottlenecked by the KV cache, whose size grows with every generated token and every concurrent request.
Reserving a contiguous max-length KV region per request wastes 60–80% of memory to fragmentation.
PagedAttention stores the KV cache in fixed-size blocks mapped through a block table, exactly like OS virtual memory — near-zero fragmentation and on-demand allocation.
Blocks can be shared across requests (common prefixes) with copy-on-write, and the higher memory efficiency translates directly into higher serving throughput (the vLLM engine).

The KV cache stores the keys and values of every past token so the model doesn’t recompute them. Classic serving reserves one contiguous buffer per request sized to the maximum possible length — but most requests finish early, leaving huge reserved-but-unused regions (internal fragmentation) and unusable gaps between requests (external fragmentation).

Paging the KV cache

PagedAttention borrows the operating-system solution: split memory into fixed-size blocks and let a block table map each sequence’s contiguous logical blocks to arbitrary physical blocks. Blocks are allocated on demand as the sequence grows, so nothing is reserved before it’s needed.

Because addressing goes through the block table, two requests with the same prompt prefix can point at the same physical blocks. Reference counting tracks the sharing; when a shared block must change, it is copied first (copy-on-write). Common system prompts and few-shot examples are then stored once, not once per request.

Why it mattered

PagedAttention reframed KV-cache management as a memory-systems problem and imported decades of OS wisdom. The payoff was large: by nearly eliminating fragmentation and enabling sharing, vLLM raised serving throughput several-fold over prior systems at the same latency. It is now the default memory manager in vLLM and has been adopted across the LLM-serving stack.

FlashAttention — the training/prefill-side memory optimization that pairs with PagedAttention’s serving-side one
Optimizing Transformer Inference — the broader inference-efficiency survey PagedAttention slots into
Attention Is All You Need — the architecture whose KV cache PagedAttention manages
Data Movement Is All You Need — the memory-bound view of transformers that explains why KV-cache management dominates serving cost

Efficient Memory Management for Large Language Model Serving with PagedAttention

TL;DR

The fragmentation problem

Paging the KV cache

Sharing prefixes with copy-on-write

Why it mattered

Related Reading