Tagged with

attention

Explore machine learning concepts related to attention. Clear explanations and practical insights.

Concepts Found

Concepts Related to attention

April 8, 2025

CLS Token in Vision Transformers

Learn how the CLS token acts as a global information aggregator in Vision Transformers, enabling whole-image classification through attention mechanisms.

deep-learning attention architectures vision-transformers

8 min readConcept

April 8, 2025

Hierarchical Attention in Vision Transformers

Explore how hierarchical attention enables Vision Transformers (ViT) to process sequential data by encoding relative positions.

deep-learning attention architectures optimization

6 min readConcept

April 8, 2025

Multi-Head Attention in Vision Transformers

Explore how multi-head attention enables Vision Transformers (ViT) to process sequential data by encoding relative positions.

deep-learning attention architectures neural-nets

6 min readConcept

April 8, 2025

Positional Embeddings in Vision Transformers

Explore how positional embeddings enable Vision Transformers (ViT) to process sequential data by encoding relative positions.

deep-learning attention architectures neural-nets

5 min readConcept

April 8, 2025

Interactive Look: Self-Attention in Vision Transformers

Explore how self-attention enables Vision Transformers (ViT) to understand images by capturing global context, with CNN comparison.

deep-learning attention architectures neural-nets

6 min readConcept

January 31, 2025

ALiBi: Attention with Linear Biases

Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.

deep-learning attention transformers position-encoding

19 min readConcept

January 31, 2025

MHA vs GQA vs MQA: Choosing the Right Attention

Compare Multi-Head, Grouped-Query, and Multi-Query Attention mechanisms to understand their trade-offs and choose the optimal approach for your use case.

deep-learning attention transformers optimization

9 min readConcept

January 31, 2025

Attention Sinks: Stable Streaming LLMs

Learn about attention sinks, where LLMs concentrate attention on initial tokens, and how preserving them enables streaming inference.

deep-learning attention transformers streaming inference

17 min readConcept

January 31, 2025

Cross-Attention: Bridging Different Modalities

Understand cross-attention, the mechanism that enables transformers to align and fuse information from different sources, sequences, or modalities.

deep-learning attention transformers multimodal

15 min readConcept

January 31, 2025

Grouped-Query Attention (GQA)

Learn how Grouped-Query Attention (GQA) balances Multi-Head quality with Multi-Query efficiency for faster LLM inference.

deep-learning attention transformers optimization

7 min readConcept

January 31, 2025

Linear Attention Approximations

Explore linear complexity attention mechanisms including Performer, Linformer, and other efficient transformers that scale to very long sequences.

deep-learning attention transformers linear-attention optimization

6 min readConcept

January 31, 2025

Masked and Causal Attention

Learn how masked attention enables autoregressive generation and prevents information leakage in transformers and language models.

deep-learning attention transformers language-models

7 min readConcept

January 31, 2025

Multi-Query Attention (MQA)

Learn Multi-Query Attention (MQA), the optimization that shares keys and values across attention heads for massive memory savings.

deep-learning attention transformers optimization

7 min readConcept

January 31, 2025

Rotary Position Embeddings (RoPE)

Learn Rotary Position Embeddings (RoPE), the elegant position encoding using rotation matrices, powering LLaMA, Mistral, and modern LLMs.

deep-learning attention transformers position-encoding

8 min readConcept

January 31, 2025

Scaled Dot-Product Attention

Master scaled dot-product attention, the fundamental transformer building block. Learn why scaling is crucial for stable training.

deep-learning attention transformers fundamentals

6 min readConcept

January 31, 2025

Sliding Window Attention

Sliding Window Attention for long sequences: local context windows enable O(n) complexity, used in Mistral and Longformer models.

deep-learning attention transformers optimization

14 min readConcept

January 31, 2025

Sparse Attention Patterns

Explore sparse attention mechanisms that reduce quadratic complexity to linear or sub-quadratic, enabling efficient processing of long sequences.

deep-learning attention transformers optimization sparse-models

7 min readConcept

January 21, 2025

Adaptive Tiling: Efficient Visual Token Generation

Learn adaptive tiling in vision transformers: dynamically partition images based on visual complexity to reduce token counts by up to 80% while preserving detail where it matters.

deep-learning architectures optimization attention

7 min readConcept

January 21, 2025