TL;DR
- The Switch Transformer replaces the dense feed-forward layer with a sparse mixture of experts, routing each token to exactly one expert (top-1 gating).
- This decouples parameters from compute: adding experts multiplies the model’s capacity while per-token FLOPs stay roughly constant.
- A capacity factor bounds how many tokens each expert accepts; overflow tokens are dropped, and an auxiliary loss keeps the load balanced.
- The design scaled to trillion-parameter models with the compute budget of a much smaller dense model.
Sparse routing: one expert per token
A dense transformer runs every token through the same feed-forward network. Switch instead keeps N expert FFNs and a small router that picks the single best expert for each token — a top-1 softmax gate.
Parameters without the compute
Because only one expert fires per token, the FLOPs a token costs do not grow as you add experts. You can pour parameters (and therefore knowledge capacity) into the model while keeping the per-token compute almost flat.
Capacity and dropped tokens
Sparse routing has a catch: real token distributions are uneven, and experts have fixed-size buffers. The capacity factor sets each expert’s buffer size; tokens routed past it are dropped for that layer. Raising the factor reduces drops at the cost of memory and compute.
Why it mattered
Switch Transformers showed that sparsity is a practical scaling axis, not just a theoretical one. By making routing simple (top-1) and stable (capacity factor + load-balancing loss), it turned mixture-of-experts into a production technique. Modern large models — from open MoE LLMs to frontier systems — inherit its core recipe: scale parameters with experts, keep compute per token bounded.
Related Reading
- Attention Is All You Need — the dense transformer whose feed-forward layer Switch replaces with sparse experts
- LoRA — a complementary efficiency axis: cheap adaptation of a fixed model rather than sparse scaling of a huge one
- FlashAttention — efficiency on the attention side, orthogonal to MoE’s efficiency on the feed-forward side
- BERT — the dense encoder baseline that sparse-expert models scale beyond
