Making Deep Learning Go Brrrr From First Principles

Horace He

TL;DR

Most deep learning practitioners optimize by guessing — try mixed precision, try a bigger batch size, hope something sticks. This article argues you should reason from first principles instead. Every GPU operation falls into one of three regimes: compute-bound, memory-bandwidth-bound, or overhead-bound. Identifying which regime you are in determines which optimizations actually help. The single most impactful technique is operator fusion, which eliminates redundant memory traffic by combining multiple operations into a single GPU kernel.

The Mental Model: GPU as Factory

The article builds its framework on a manufacturing analogy. The GPU is a factory with three components:

Compute units (workers): The arithmetic logic units and tensor cores that perform floating-point operations. An A100 GPU can execute 312 TFLOPS with tensor cores, or 19.5 TFLOPS for general-purpose math.
DRAM (warehouse): Global GPU memory (HBM) where tensors are stored. The A100 provides 1.5 TB/s of memory bandwidth — fast in absolute terms, but slow relative to compute throughput.
Overhead (administration): Everything that is not compute or memory access — Python interpreter time, PyTorch framework dispatch, CUDA kernel launch latency, and similar coordination costs.

The fundamental tension: GPU compute has been scaling faster than memory bandwidth for decades. The A100 can perform 312 trillion floating-point operations per second, but can only load about 400 billion 32-bit numbers per second from memory. This means the GPU needs to perform roughly 780 operations per element loaded just to keep the compute units fully utilized. Most deep learning operations fall far short of this ratio.

The Three Bottleneck Regimes

The article's central contribution is a clear taxonomy for diagnosing performance problems. Before optimizing anything, you need to determine which regime your workload falls into.

Compute-bound operations spend most of their time doing arithmetic. Large matrix multiplications are the canonical example — a matmul of two n × n matrices requires O(n³) operations but only O(n²) memory accesses. The arithmetic intensity (operations per byte transferred) is high enough that the compute units are the bottleneck. Optimizations here focus on using tensor cores, increasing precision efficiency (TF32, FP16, INT8), and maximizing hardware utilization.

Memory-bandwidth-bound operations spend most of their time moving data rather than computing on it. Pointwise operations like torch.cos(), activation functions, and normalization layers fall into this category. A unary elementwise operation performs exactly 1 FLOP per element but must read and write that element from/to global memory (8 bytes round-trip for FP32). The arithmetic intensity is 0.125 FLOPS/byte — orders of magnitude below the compute-bound threshold.

This leads to a counterintuitive result: on an A100, a fused x.cos().cos() takes nearly the same wall-clock time as a single x.cos(), because both are bottlenecked by the same memory reads and writes. The second cosine is essentially free — the data is already in registers.

Overhead-bound operations are limited by neither compute nor memory bandwidth, but by the cost of launching and coordinating work. Python executes roughly 32 million operations per second. In the time Python performs a single operation, an A100 could complete approximately 10 million floating-point operations. For small tensors or models with many tiny operations, the time spent in the Python interpreter and PyTorch's dispatch machinery can dominate total runtime.

PyTorch partially mitigates this through asynchronous CUDA execution: while the GPU processes one kernel, the CPU can queue up subsequent kernels. As long as the CPU stays ahead of the GPU, overhead is hidden. But when individual kernels are very fast (small tensors, simple operations), the CPU cannot queue work quickly enough and becomes the bottleneck.

Arithmetic Intensity and the Roofline Model

The article uses the roofline model to formalize the boundary between compute-bound and memory-bound regimes. The key metric is arithmetic intensity:

\text{Arithmetic Intensity} = \text{FLOPs performed}\text{Bytes transferred}

For a given GPU, there is a crossover point where memory bandwidth and compute throughput are balanced. On an A100:

\text{Crossover} = 312 × 10¹² \text{ FLOPS}1.5 × 10¹² \text{ B/s} ≈ 208 \text{ FLOPS/byte}

Operations below this threshold are memory-bound; operations above it are compute-bound. In practice, with FP32 elements (4 bytes each) and considering read-write round trips, roughly 100 arithmetic operations per element are needed before compute becomes the bottleneck.

The roofline model explains why profiling tools that report "achieved TFLOPS" can be misleading. A pointwise operation might show only 0.3 TFLOPS on an A100 — not because the GPU is underutilized in a fixable way, but because the operation is fundamentally memory-bound. No amount of kernel tuning will change the fact that the operation performs 1 FLOP per 8 bytes transferred.

Operator Fusion: The Key Optimization

Fusion is the article's central optimization technique. The idea is to combine multiple operations into a single GPU kernel, eliminating intermediate reads and writes to global memory.

Vertical fusion (also called pointwise fusion) combines a chain of elementwise operations. Without fusion, x.cos().cos() requires four global memory accesses: read x, write cos(x), read cos(x), write cos(cos(x)). With fusion, the intermediate result stays in GPU registers, reducing memory traffic to two accesses: read x, write cos(cos(x)). Since both versions are memory-bound, this 2x reduction in memory traffic translates directly to a 2x speedup.

Horizontal fusion combines an elementwise operation with an adjacent reduction or matmul. For example, fusing a bias addition or activation function into the epilogue of a matrix multiplication avoids writing the matmul output to global memory, applying the activation, and reading/writing again. The activation is computed on-the-fly as each output tile is produced.

This principle extends to more sophisticated cases. FlashAttention (Dao et al., 2022) fuses the entire attention computation — QK^T multiplication, softmax, and value weighting — into a single kernel that tiles across the sequence dimension. By keeping intermediate attention scores in SRAM rather than writing them to HBM, FlashAttention reduces memory traffic from O(n²) to O(n) in sequence length, yielding 2–4x wall-clock speedups despite performing the same number of FLOPs.

Where the FLOPs Actually Go

The article references profiling data from BERT to illustrate a critical disconnect between FLOP count and wall-clock time. In a standard BERT forward pass:

Tensor contractions (matmuls) account for 99.8% of total FLOPs
Normalization layers achieve roughly 250x fewer FLOPS than matmuls
Pointwise operations achieve roughly 700x fewer FLOPS than matmuls

Despite dominating the FLOP count, matmuls do not dominate wall-clock time proportionally, because normalization and pointwise operations — while computationally trivial — still require expensive memory round-trips. Each layer norm, GELU activation, or residual addition launches a separate kernel that reads from and writes to global memory. These memory-bound operations collectively consume a significant fraction of total runtime.

This is precisely the problem fusion addresses. Tools like torch.compile (the successor to TorchScript and the JIT compiler) analyze the computation graph, identify fusible subgraphs, and generate optimized CUDA kernels that eliminate intermediate memory traffic.

Recomputation as Optimization

The article surfaces a counterintuitive insight: recomputing values can be faster than storing and reloading them. In standard backpropagation, intermediate activations are saved during the forward pass for use in the backward pass. This increases memory traffic — every saved activation requires a write to global memory during the forward pass and a read during the backward pass.

With operator fusion, it can be cheaper to recompute an activation from its inputs (which are already being loaded anyway) than to perform the additional memory round-trip of saving and loading it. This is the principle behind activation checkpointing (also called gradient checkpointing or rematerialization): selectively recompute activations during the backward pass instead of storing them, trading a small amount of extra compute for a large reduction in memory traffic and memory consumption.

The key insight is that this trade-off is only favorable because the recomputed operations are memory-bound. Recomputing a chain of pointwise ops costs almost nothing in wall-clock time when fused, because the compute is free — you are already paying for the memory access.

Practical Implications

The article's framework has direct consequences for how practitioners should approach optimization:

Profile before optimizing. Measure achieved FLOPS as a percentage of peak. If utilization is high (50%+ of peak), you are compute-bound and should look at precision reduction or algorithmic changes. If utilization is low, you are likely memory-bound or overhead-bound.
Fusion is the highest-leverage optimization for most workloads. The majority of operations in a typical training loop are memory-bound pointwise ops. Fusing them eliminates redundant memory traffic without changing the mathematical result.
Batch size affects the regime. Small batch sizes produce small matmuls with low arithmetic intensity, pushing them from compute-bound toward memory-bound. Large batch sizes increase arithmetic intensity and improve compute utilization, but require more memory.
The gap between compute and bandwidth is widening. Each new GPU generation increases TFLOPS faster than TB/s. The A100's successor (H100) has 3x the FLOPS but only 2x the bandwidth. This means memory-bound operations become a larger fraction of total runtime with each hardware generation, making fusion increasingly important.

Key Takeaways

Diagnose before you treat — blindly applying optimizations wastes effort. Determine whether your workload is compute-bound, memory-bound, or overhead-bound before choosing a strategy.
Memory bandwidth is the dominant bottleneck — most deep learning operations are memory-bound, not compute-bound. The GPU spends more time moving data than computing on it.
Operator fusion is the single most effective technique — by eliminating intermediate memory traffic, fusion can yield 2–4x speedups on memory-bound subgraphs with zero change to numerical results.
Arithmetic intensity determines the regime — the roofline model provides a quantitative framework for predicting whether an operation will benefit from compute optimization or memory optimization.
Recomputation can beat caching — when fused operations are memory-bound, recomputing activations is cheaper than the memory round-trip of saving and reloading them.

Impact and Context

This article has become a standard reference in the ML systems community for reasoning about GPU performance. Its first-principles framework predates and motivates much of the work in modern compiler-driven optimization: torch.compile, Triton, and XLA all perform exactly the kind of operator fusion the article describes.

The insights connect directly to FlashAttention, which applies the fusion principle to the attention mechanism specifically, and to work on efficient transformer inference more broadly. The roofline model perspective also explains why techniques like quantization (INT8, FP8) are effective: they reduce bytes per element, shifting the arithmetic intensity curve and moving more operations from memory-bound into compute-bound territory where the hardware can be fully utilized.

For practitioners, the article's lasting contribution is a mental model. Rather than memorizing a checklist of optimizations, it teaches you to reason about where time is actually spent — and that reasoning remains valid across GPU generations, frameworks, and model architectures.

Data Movement Is All You Need — formal analysis of data movement bottlenecks in transformers, quantifying the same memory-bound phenomenon
Optimizing Transformer Inference — survey of pruning, quantization, and hardware-aware techniques that complement fusion-based optimization
Attention Is All You Need — the transformer architecture whose attention mechanism is the primary target of fusion optimizations like FlashAttention