Skip to main content

Tokenization: Converting Text to Numbers

Summary
Interactive exploration of tokenization methods in LLMs - BPE, SentencePiece, and WordPiece. Understand how text becomes tokens that models can process.

Tokenization in Large Language Models

Tokenization is the foundational step in LLM processing - converting raw text into numerical tokens that a model's self-attention layers can process. The choice of tokenization method significantly impacts model performance, vocabulary size, and handling of rare words.

Interactive Tokenization Explorer

Experiment with different tokenization methods and see how they break down text:

Why Tokenization Matters

The Vocabulary Size Dilemma

Vchar = 256 ≪ Vsubword ≈ 50k ≪ Vword ≈ 170k

Different granularities offer different trade-offs:

  • Character-level: Small vocabulary, long sequences
  • Subword-level: Balanced vocabulary and sequence length
  • Word-level: Large vocabulary, short sequences

The OOV Problem

Out-of-vocabulary (OOV) words are a critical challenge:

  • Word-level tokenization fails on new/rare words
  • Subword methods can handle any word by decomposition
  • Character-level always works but loses semantic meaning

Tokenization Methods

Byte Pair Encoding (BPE)

BPE builds vocabulary through iterative merging:

  1. Initialize with character-level tokens
  2. Count all adjacent token pairs
  3. Merge the most frequent pair
  4. Add new token to vocabulary
  5. Repeat until vocabulary size reached
P(w) = Πi=1n P(ti | t)

SentencePiece

Key differences from BPE:

  • Treats text as raw byte stream
  • Includes whitespace in tokens (▁ prefix)
  • Language-agnostic (no pre-tokenization)
  • Fully reversible

WordPiece

Used by BERT family models:

  • Maximizes likelihood of training data
  • Uses ## prefix for subwords
  • Requires pre-tokenization
  • More deterministic than BPE

Tokenization in Practice

GPT Models (BPE)

# GPT tokenization example tokens = tokenizer.encode("Hello world!") # Result: [15496, 995, 0]

BERT Models (WordPiece)

# BERT tokenization example tokens = tokenizer.encode("[CLS] Hello world! [SEP]") # Result: [101, 7592, 2088, 999, 102]

T5 Models (SentencePiece)

# T5 tokenization example tokens = tokenizer.encode("Hello world!") # Result: [▁Hello, ▁world, !]

If you found this explanation helpful, consider sharing it with others.

Mastodon