Context Windows in Large Language Models
Context windows define the maximum amount of text an LLM can process at once - the model's "working memory." This fundamental constraint shapes how models understand and generate text, from simple queries to complex documents.
Interactive Context Window Explorer
Explore different windowing strategies and see how they affect token processing:
The Context Length Challenge
Memory Complexity
The quadratic complexity of self-attention is the primary bottleneck:
Where:
- n = sequence length (context size)
- d = model dimension
For a 100K token context:
- Attention matrix: 100,000² = 10 billion elements
- Memory required: ~40GB (float32)
- Computation: 10 billion dot products per layer
The Information Bottleneck
Beyond cost, the window is an information bottleneck: the model must fit everything it needs to reason about into one fixed-size sequence.
Extending the context window
A bigger window means more O(n²) attention, so practical long-context models combine several tricks rather than just scaling n:
- Cheaper attention — sliding-window and sparse attention drop the cost from O(n²) toward O(n·w), and FlashAttention keeps attention exact at O(n) memory; reusing past Keys/Values via the KV cache avoids recomputing them each step.
- Positions that extrapolate — rotary embeddings (with interpolation / NTK / YaRN scaling) and ALiBi let a model run on sequences longer than it trained on.
- Anchoring the start — attention sinks keep the first few tokens cached so streaming past the trained length stays stable.
- Beyond any window — retrieval-augmented generation (RAG) fetches only the relevant chunks from an external store instead of growing the window at all.
Practical Implications
Effective Context Utilization
Not all of the context is equally used. Studies show models primarily attend to:
- Beginning (primacy effect)
- End (recency effect)
- Semantically relevant sections
The Lost Middle Problem
Performance degradation in middle of long contexts:
- Start: High attention (prompts, instructions)
- Middle: Low attention (often ignored)
- End: High attention (recent context)
Context Length vs Quality Trade-off
| Context Size | Benefits | Drawbacks |
|---|---|---|
| 2K | Fast, cheap | Limited applications |
| 8K | Good for most tasks | May truncate documents |
| 32K | Full documents | Slower, more expensive |
| 100K+ | Books, codebases | Very slow, costly |
Measuring it: Needle in a Haystack
The standard probe for lost-in-the-middle is the needle-in-a-haystack test: plant a fact at many positions in a long context, query for it, and measure retrieval accuracy by position. Strong long-context models stay accurate everywhere; weaker ones miss the middle.
Related concepts
Deep dive into how different prompt components influence model behavior across transformer layers, from surface patterns to abstract reasoning.
Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.
How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.
Learn about attention sinks, where LLMs concentrate attention on initial tokens, and how preserving them enables streaming inference.
Understand cross-attention, the mechanism that enables transformers to align and fuse information from different sources, sequences, or modalities.
Interactive Flash Attention visualization - the IO-aware algorithm achieving memory-efficient exact attention through tiling and kernel fusion.
