Sliding Window Attention

Summary: Sliding Window Attention for long sequences: local context windows enable O(n) complexity, used in Mistral and Longformer models.

The idea: attend locally, not globally

Full self-attention compares every token with every other token, which costs O(n²) — the wall that makes long sequences expensive. Sliding window attention removes the wall with one restriction: each token only attends to a fixed-size window of nearby tokens instead of the whole sequence.

With a window of w, every query touches at most w keys, so cost drops to O(n × w) — linear in sequence length when w is fixed.

This page assumes you know self-attention and why standard attention is quadratic.

Interactive visualization

Step through how the window slides, and watch the receptive field grow as you stack layers:

Two flavours: causal vs symmetric

The "window" means different things depending on the model:

Causal (decoder LMs — Mistral): token i attends to the previous w tokens, positions [i-w+1,\ i]. It never looks ahead, so it works for autoregressive generation. Mistral 7B uses a causal window of 4096.
Symmetric (encoders — Longformer): token i attends to w tokens on each side, positions [i-w,\ i+w] — a band of width 2w+1.

Either way, attention is computed only over the windowed keys and values:

\text{Attention}(i) = \text{softmax}\!(Q_i\,K_\text{window(i)}^\top√(d)) V_\text{window(i)}

In practice the restriction is applied as a band-diagonal mask before softmax, so it parallelizes just like full attention.

Why locality is enough: depth grows the receptive field

A single windowed layer only sees w neighbours — so how can such a model capture long-range structure? The same way a stack of small convolutions does: the receptive field grows with depth.

Layer 1 mixes information within a window of w. Layer 2 attends to tokens that already absorbed their windows, so its effective reach is 2w. After L layers a token can be influenced by information up to roughly L × w positions away:

\text{receptive field} ≈ L × w

Mistral's 32 layers × 4096 window gives a theoretical reach of ~131K tokens — enough to span a 32K context. (It's an upper bound: signal still has to survive being re-mixed at every layer, so effective range is shorter than the theoretical one.)

Cost compared to full attention

	Full attention	Sliding window
Attention per token	all n keys	w keys
Time	O(n²)	O(n × w)
Memory (scores)	O(n²)	O(n × w)
Long-range context	direct	indirect, via depth

For a 32K sequence with a 4K window the per-layer attention matrix shrinks by ~87.5% — the saving that lets long-context models run on the same hardware.

Combining and extending it

Sliding windows are usually one ingredient, not the whole recipe:

Global tokens (Longformer / BigBird): keep a few tokens that attend to — and are attended by — everything, so a CLS-like token or document metadata stays globally reachable while the rest stay local.
Dilated windows: attend to every d-th position instead of consecutive ones, widening reach without adding keys — the attention analogue of dilated convolutions.
Attention sinks: pin the first few tokens alongside the window to keep streaming generation stable.
Constant-memory inference: pair the window with a rolling KV cache (Mistral) so only the last w keys are stored, and fuse the band mask into FlashAttention to avoid materializing it.

It is one point on a spectrum of cheaper-than-quadratic attention — see also sparse attention patterns and the broader idea of the context window.

When it fits (and when it doesn't)

Sliding window attention shines when the signal is mostly local — code, long documents, audio/transcription — and where you can afford to let long-range information propagate through depth. Its weakness is the flip side: anything needing a direct long jump (retrieving a fact from 20K tokens ago in a single hop) is harder, since that information must survive being re-mixed layer by layer. Global tokens and attention sinks exist precisely to patch that gap.