Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus; Barret Zoph; Noam Shazeer

TL;DR

The Switch Transformer replaces the dense feed-forward layer with a sparse mixture of experts, routing each token to exactly one expert (top-1 gating).
This decouples parameters from compute: adding experts multiplies the model’s capacity while per-token FLOPs stay roughly constant.
A capacity factor bounds how many tokens each expert accepts; overflow tokens are dropped, and an auxiliary loss keeps the load balanced.
The design scaled to trillion-parameter models with the compute budget of a much smaller dense model.

Sparse routing: one expert per token

A dense transformer runs every token through the same feed-forward network. Switch instead keeps N expert FFNs and a small router that picks the single best expert for each token — a top-1 softmax gate.

Parameters without the compute

Because only one expert fires per token, the FLOPs a token costs do not grow as you add experts. You can pour parameters (and therefore knowledge capacity) into the model while keeping the per-token compute almost flat.

Capacity and dropped tokens

Sparse routing has a catch: real token distributions are uneven, and experts have fixed-size buffers. The capacity factor sets each expert’s buffer size; tokens routed past it are dropped for that layer. Raising the factor reduces drops at the cost of memory and compute.

Why it mattered

Switch Transformers showed that sparsity is a practical scaling axis, not just a theoretical one. By making routing simple (top-1) and stable (capacity factor + load-balancing loss), it turned mixture-of-experts into a production technique. Modern large models — from open MoE LLMs to frontier systems — inherit its core recipe: scale parameters with experts, keep compute per token bounded.

Attention Is All You Need — the dense transformer whose feed-forward layer Switch replaces with sparse experts
LoRA — a complementary efficiency axis: cheap adaptation of a fixed model rather than sparse scaling of a huge one
FlashAttention — efficiency on the attention side, orthogonal to MoE’s efficiency on the feed-forward side
BERT — the dense encoder baseline that sparse-expert models scale beyond