Tokenization in Large Language Models
Tokenization is the foundational step in LLM processing - converting raw text into numerical tokens that a model's self-attention layers can process. The choice of tokenization method significantly impacts model performance, vocabulary size, and handling of rare words.
Interactive Tokenization Explorer
Experiment with different tokenization methods and see how they break down text:
Why Tokenization Matters
The Vocabulary Size Dilemma
Different granularities offer different trade-offs:
- Character-level: Small vocabulary, long sequences
- Subword-level: Balanced vocabulary and sequence length
- Word-level: Large vocabulary, short sequences
The OOV Problem
Out-of-vocabulary (OOV) words are a critical challenge:
- Word-level tokenization fails on new/rare words
- Subword methods can handle any word by decomposition
- Character-level always works but loses semantic meaning
Tokenization Methods
Byte Pair Encoding (BPE)
BPE builds vocabulary through iterative merging:
- Initialize with character-level tokens
- Count all adjacent token pairs
- Merge the most frequent pair
- Add new token to vocabulary
- Repeat until vocabulary size reached
SentencePiece
Key differences from BPE:
- Treats text as raw byte stream
- Includes whitespace in tokens (▁ prefix)
- Language-agnostic (no pre-tokenization)
- Fully reversible
WordPiece
Used by BERT family models:
- Maximizes likelihood of training data
- Uses ## prefix for subwords
- Requires pre-tokenization
- More deterministic than BPE
Tokenization in Practice
GPT Models (BPE)
# GPT tokenization example tokens = tokenizer.encode("Hello world!") # Result: [15496, 995, 0]
BERT Models (WordPiece)
# BERT tokenization example tokens = tokenizer.encode("[CLS] Hello world! [SEP]") # Result: [101, 7592, 2088, 999, 102]
T5 Models (SentencePiece)
# T5 tokenization example tokens = tokenizer.encode("Hello world!") # Result: [▁Hello, ▁world, !]
Related concepts
Interactive visualization of LLM context windows - sliding windows, expanding contexts, and attention patterns that define model memory limits.
Interactive KV cache visualization - how key-value caching in LLM transformers enables fast text generation without quadratic recomputation.
Deep dive into how different prompt components influence model behavior across transformer layers, from surface patterns to abstract reasoning.
Learn ALiBi, the position encoding method that adds linear biases to attention scores for exceptional length extrapolation in transformers.
How Flash Attention, Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA) compare — algorithm vs architecture, KV-cache memory, quality trade-offs, and how to choose for production transformer inference.
Learn about attention sinks, where LLMs concentrate attention on initial tokens, and how preserving them enables streaming inference.
