Skip to main content

Efficient Memory Management for Large Language Model Serving with PagedAttention

How PagedAttention (the memory manager behind vLLM) applies OS-style virtual-memory paging to the KV cache — fixed-size blocks, a block table, and copy-on-write prefix sharing — to eliminate fragmentation and dramatically raise LLM serving throughput.

TL;DR

  • LLM serving is bottlenecked by the KV cache, whose size grows with every generated token and every concurrent request.
  • Reserving a contiguous max-length KV region per request wastes 60–80% of memory to fragmentation.
  • PagedAttention stores the KV cache in fixed-size blocks mapped through a block table, exactly like OS virtual memory — near-zero fragmentation and on-demand allocation.
  • Blocks can be shared across requests (common prefixes) with copy-on-write, and the higher memory efficiency translates directly into higher serving throughput (the vLLM engine).

The fragmentation problem

The KV cache stores the keys and values of every past token so the model doesn’t recompute them. Classic serving reserves one contiguous buffer per request sized to the maximum possible length — but most requests finish early, leaving huge reserved-but-unused regions (internal fragmentation) and unusable gaps between requests (external fragmentation).

Paging the KV cache

PagedAttention borrows the operating-system solution: split memory into fixed-size blocks and let a block table map each sequence’s contiguous logical blocks to arbitrary physical blocks. Blocks are allocated on demand as the sequence grows, so nothing is reserved before it’s needed.

Sharing prefixes with copy-on-write

Because addressing goes through the block table, two requests with the same prompt prefix can point at the same physical blocks. Reference counting tracks the sharing; when a shared block must change, it is copied first (copy-on-write). Common system prompts and few-shot examples are then stored once, not once per request.

Why it mattered

PagedAttention reframed KV-cache management as a memory-systems problem and imported decades of OS wisdom. The payoff was large: by nearly eliminating fragmentation and enabling sharing, vLLM raised serving throughput several-fold over prior systems at the same latency. It is now the default memory manager in vLLM and has been adopted across the LLM-serving stack.

If you found this paper review helpful, consider sharing it with others.

Mastodon