Attention Sinks: Stable Streaming LLMs
Learn about attention sinks, where LLMs concentrate attention on initial tokens, and how preserving them enables streaming inference.
17 min readConcept
Explore machine learning concepts related to inference. Clear explanations and practical insights.
Learn about attention sinks, where LLMs concentrate attention on initial tokens, and how preserving them enables streaming inference.
Interactive KV cache visualization - how key-value caching in LLM transformers enables fast text generation without quadratic recomputation.