The idea: attend locally, not globally
Full self-attention compares every token with every other token, which costs O(n2) — the wall that makes long sequences expensive. Sliding window attention removes the wall with one restriction: each token only attends to a fixed-size window of nearby tokens instead of the whole sequence.
With a window of w, every query touches at most w keys, so cost drops to O(n × w) — linear in sequence length when w is fixed.
This page assumes you know self-attention and why standard attention is quadratic.
Interactive visualization
Step through how the window slides, and watch the receptive field grow as you stack layers:
Two flavours: causal vs symmetric
The "window" means different things depending on the model:
- Causal (decoder LMs — Mistral): token i attends to the previous w tokens, positions [i-w+1,\ i]. It never looks ahead, so it works for autoregressive generation. Mistral 7B uses a causal window of 4096.
- Symmetric (encoders — Longformer): token i attends to w tokens on each side, positions [i-w,\ i+w] — a band of width 2w+1.
Either way, attention is computed only over the windowed keys and values:
In practice the restriction is applied as a band-diagonal mask before softmax, so it parallelizes just like full attention.
Why locality is enough: depth grows the receptive field
A single windowed layer only sees w neighbours — so how can such a model capture long-range structure? The same way a stack of small convolutions does: the receptive field grows with depth.
Layer 1 mixes information within a window of w. Layer 2 attends to tokens that already absorbed their windows, so its effective reach is 2w. After L layers a token can be influenced by information up to roughly L × w positions away:
Mistral's 32 layers × 4096 window gives a theoretical reach of ~131K tokens — enough to span a 32K context. (It's an upper bound: signal still has to survive being re-mixed at every layer, so effective range is shorter than the theoretical one.)
Cost compared to full attention
| Full attention | Sliding window | |
|---|---|---|
| Attention per token | all n keys | w keys |
| Time | O(n2) | O(n × w) |
| Memory (scores) | O(n2) | O(n × w) |
| Long-range context | direct | indirect, via depth |
For a 32K sequence with a 4K window the per-layer attention matrix shrinks by ~87.5% — the saving that lets long-context models run on the same hardware.
Combining and extending it
Sliding windows are usually one ingredient, not the whole recipe:
- Global tokens (Longformer / BigBird): keep a few tokens that attend to — and are attended by — everything, so a
CLS-like token or document metadata stays globally reachable while the rest stay local. - Dilated windows: attend to every d-th position instead of consecutive ones, widening reach without adding keys — the attention analogue of dilated convolutions.
- Attention sinks: pin the first few tokens alongside the window to keep streaming generation stable.
- Constant-memory inference: pair the window with a rolling KV cache (Mistral) so only the last w keys are stored, and fuse the band mask into FlashAttention to avoid materializing it.
It is one point on a spectrum of cheaper-than-quadratic attention — see also sparse attention patterns and the broader idea of the context window.
When it fits (and when it doesn't)
Sliding window attention shines when the signal is mostly local — code, long documents, audio/transcription — and where you can afford to let long-range information propagate through depth. Its weakness is the flip side: anything needing a direct long jump (retrieving a fact from 20K tokens ago in a single hop) is harder, since that information must survive being re-mixed layer by layer. Global tokens and attention sinks exist precisely to patch that gap.
Further reading
- Longformer: The Long-Document Transformer — Beltagy et al., 2020 (sliding + global + dilated windows)
- Mistral 7B — Jiang et al., 2023 (causal sliding window + rolling buffer cache)
Related concepts
How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.
Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.
Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.
Learn Multi-Query Attention (MQA), the optimization that shares keys and values across attention heads for massive memory savings.
Explore sparse attention mechanisms that reduce quadratic complexity to linear or sub-quadratic, enabling efficient processing of long sequences.
Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.
