Skip to main content

Context Windows: The Memory Limits of LLMs

Summary
Interactive visualization of LLM context windows - sliding windows, expanding contexts, and attention patterns that define model memory limits.

Context Windows in Large Language Models

Context windows define the maximum amount of text an LLM can process at once - the model's "working memory." This fundamental constraint shapes how models understand and generate text, from simple queries to complex documents.

Interactive Context Window Explorer

Explore different windowing strategies and see how they affect token processing:

The Context Length Challenge

Memory Complexity

The quadratic complexity of self-attention is the primary bottleneck:

O(n2 · d)

Where:

  • n = sequence length (context size)
  • d = model dimension

For a 100K token context:

  • Attention matrix: 100,000² = 10 billion elements
  • Memory required: ~40GB (float32)
  • Computation: 10 billion dot products per layer

The Information Bottleneck

Beyond cost, the window is an information bottleneck: the model must fit everything it needs to reason about into one fixed-size sequence.

Extending the context window

A bigger window means more O(n²) attention, so practical long-context models combine several tricks rather than just scaling n:

  • Cheaper attentionsliding-window and sparse attention drop the cost from O(n²) toward O(n·w), and FlashAttention keeps attention exact at O(n) memory; reusing past Keys/Values via the KV cache avoids recomputing them each step.
  • Positions that extrapolaterotary embeddings (with interpolation / NTK / YaRN scaling) and ALiBi let a model run on sequences longer than it trained on.
  • Anchoring the startattention sinks keep the first few tokens cached so streaming past the trained length stays stable.
  • Beyond any window — retrieval-augmented generation (RAG) fetches only the relevant chunks from an external store instead of growing the window at all.

Practical Implications

Effective Context Utilization

Not all of the context is equally used. Studies show models primarily attend to:

  • Beginning (primacy effect)
  • End (recency effect)
  • Semantically relevant sections

The Lost Middle Problem

Performance degradation in middle of long contexts:

  1. Start: High attention (prompts, instructions)
  2. Middle: Low attention (often ignored)
  3. End: High attention (recent context)

Context Length vs Quality Trade-off

Context SizeBenefitsDrawbacks
2KFast, cheapLimited applications
8KGood for most tasksMay truncate documents
32KFull documentsSlower, more expensive
100K+Books, codebasesVery slow, costly

Measuring it: Needle in a Haystack

The standard probe for lost-in-the-middle is the needle-in-a-haystack test: plant a fact at many positions in a long context, query for it, and measure retrieval accuracy by position. Strong long-context models stay accurate everywhere; weaker ones miss the middle.

If you found this explanation helpful, consider sharing it with others.

Mastodon