Tokenization: Converting Text to Numbers

Summary: Interactive exploration of tokenization methods in LLMs - BPE, SentencePiece, and WordPiece. Understand how text becomes tokens that models can process.

Tokenization in Large Language Models

Tokenization is the foundational step in LLM processing - converting raw text into numerical tokens that a model's self-attention layers can process. The choice of tokenization method significantly impacts model performance, vocabulary size, and handling of rare words.

Interactive Tokenization Explorer

Experiment with different tokenization methods and see how they break down text:

Why Tokenization Matters

The Vocabulary Size Dilemma

V_char = 256 ≪ V_subword ≈ 50k ≪ V_word ≈ 170k

Different granularities offer different trade-offs:

Character-level: Small vocabulary, long sequences
Subword-level: Balanced vocabulary and sequence length
Word-level: Large vocabulary, short sequences

The OOV Problem

Out-of-vocabulary (OOV) words are a critical challenge:

Word-level tokenization fails on new/rare words
Subword methods can handle any word by decomposition
Character-level always works but loses semantic meaning

Tokenization Methods

Byte Pair Encoding (BPE)

BPE builds vocabulary through iterative merging:

Initialize with character-level tokens
Count all adjacent token pairs
Merge the most frequent pair
Add new token to vocabulary
Repeat until vocabulary size reached

P(w) = Π_i=1ⁿ P(t_i | t₎

SentencePiece

Key differences from BPE:

Treats text as raw byte stream
Includes whitespace in tokens (▁ prefix)
Language-agnostic (no pre-tokenization)
Fully reversible

WordPiece

Used by BERT family models:

Maximizes likelihood of training data
Uses ## prefix for subwords
Requires pre-tokenization
More deterministic than BPE

Tokenization in Practice

GPT Models (BPE)

# GPT tokenization example
tokens = tokenizer.encode("Hello world!")
# Result: [15496, 995, 0]

BERT Models (WordPiece)

# BERT tokenization example
tokens = tokenizer.encode("[CLS] Hello world! [SEP]")
# Result: [101, 7592, 2088, 999, 102]

T5 Models (SentencePiece)

# T5 tokenization example
tokens = tokenizer.encode("Hello world!")
# Result: [▁Hello, ▁world, !]

Transformers & LLMs

Context Windows: The Memory Limits of LLMs

Interactive visualization of LLM context windows - sliding windows, expanding contexts, and attention patterns that define model memory limits.

Transformers & LLMs

KV Cache: The Secret to Fast LLM Inference

Interactive KV cache visualization - how key-value caching in LLM transformers enables fast text generation without quadratic recomputation.

Deep Learning

Prompt Influence Flow Through Transformer Layers

Deep dive into how different prompt components influence model behavior across transformer layers, from surface patterns to abstract reasoning.

Transformers & LLMs

ALiBi: Attention with Linear Biases

Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.

Transformers & LLMs

Flash Attention vs MHA vs GQA vs MQA: Comparing Attention Mechanisms

How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.

Transformers & LLMs

Attention Sinks: Stable Streaming LLMs

Learn about attention sinks, where LLMs concentrate attention on initial tokens, and how preserving them enables streaming inference.