Data Movement Is All You Need: Optimizing Transformers

Andrei Ivanov; Nikoli Dryden; Tal Ben-Nun; Shigang Li; Torsten Hoefler

TL;DR

Most transformer operations are not bottlenecked by arithmetic — they are bottlenecked by data movement. This paper profiles transformer training and inference end-to-end, categorizes every operation as compute-bound or memory-bound, and shows that data movement (moving tensors between HBM, caches, and registers) accounts for the majority of execution time. The authors then demonstrate that operator fusion and data layout optimizations can recover much of this wasted time, providing a principled framework for understanding where GPU cycles actually go in transformer workloads.

The Core Problem: Arithmetic Is Not the Bottleneck

Modern GPUs like the A100 can perform 312 TFLOPS of FP16 arithmetic per second, but their memory bandwidth tops out at around 2 TB/s. This creates a fundamental asymmetry: for any operation with an arithmetic intensity below roughly 156 FLOPs per byte loaded, the GPU spends more time waiting for data than computing on it.

The paper quantifies this using the operational intensity metric, defined as the ratio of floating-point operations to bytes moved:

I = \text{FLOPs}\text{Bytes transferred}

An operation is compute-bound when I exceeds the machine’s compute-to-bandwidth ratio (the “ridge point” on the roofline model), and memory-bound when it falls below. The key finding: the majority of transformer operations — layer normalization, dropout, softmax, GELU activations, residual additions, and bias terms — are elementwise or reduction operations with I ≈ O(1), placing them firmly in the memory-bound regime.

Only the large matrix multiplications in the linear projections and attention scores (QK^T and attention-weighted value computation) have arithmetic intensity high enough to be compute-bound. Everything else is starved for bandwidth.

The Roofline Model Applied to Transformers

The authors frame their analysis using the roofline model, a standard tool from high-performance computing. The roofline plots achievable performance (FLOPS) against operational intensity (FLOPs/byte). Every operation falls into one of two regimes:

\text{Performance} = min(\text{Peak FLOPS},\; I × \text{Bandwidth})

For a V100 GPU with 125 TFLOPS peak FP16 and 900 GB/s bandwidth, the ridge point is at I ≈ 139 FLOPs/byte. Operations below this threshold are bandwidth-limited regardless of how well the kernel is optimized. The paper plots each transformer operation on this roofline and shows that layer normalization has I ≈ 5, softmax has I ≈ 3, GELU has I ≈ 1, and dropout has I < 1. These operations run at a fraction of the GPU’s theoretical peak — not because the CUDA kernels are poorly written, but because the hardware physically cannot deliver data fast enough to keep the compute units busy.

In contrast, the large GEMM operations in the feed-forward layers (with dimensions [n, d] × [d, 4d]) achieve I ≈ O(d), which for typical hidden dimensions of 768–1024 places them well above the ridge point, allowing them to saturate the GPU’s compute capability.

Profiling Methodology

The authors instrument transformer training (BERT, GPT-2) and inference across multiple GPU architectures (V100, A100) using NVIDIA’s Nsight profiling tools. They decompose execution time into three categories:

Compute-bound GEMM kernels — the matrix multiplications in Q = XW_Q, K = XW_K, V = XW_V, attention scores A = QK^T / √(d_k), and the feed-forward network layers. These have high arithmetic intensity and achieve good hardware utilization.
Memory-bound non-GEMM kernels — softmax, layer normalization, GELU, dropout, residual connections, and bias additions. Each of these reads its inputs from HBM, applies a cheap elementwise or reduction operation, and writes results back to HBM. The arithmetic is trivial; the cost is entirely in the memory round-trips.
Communication overhead — in distributed training, all-reduce operations for gradient synchronization add latency that overlaps partially with computation but creates pipeline bubbles.

The breakdown reveals that non-GEMM (memory-bound) operations consume 40–70% of total execution time depending on model size and hardware. For smaller models like BERT-Base, the fraction is even higher because the GEMM dimensions are too small to saturate compute units.

Optimization Strategies

The paper proposes and evaluates three categories of optimization, all targeting the memory-bound operations.

Operator Fusion is the most impactful technique. Instead of launching separate GPU kernels for each elementwise operation (each requiring a full HBM read-write round-trip), fused kernels chain multiple operations together. For example, fusing the bias addition, GELU activation, and dropout into a single kernel eliminates two intermediate HBM round-trips. The data stays in registers or shared memory between operations. For a fused sequence of k elementwise operations on a tensor of n elements, the memory traffic drops from O(kn) to O(n).

Data Layout Optimization reorganizes tensor storage to maximize spatial locality. The default NCHW layout in PyTorch can cause strided memory access patterns in certain operations. Switching to NHWC or using memory-aligned layouts reduces cache misses and enables more efficient vectorized loads.

Communication-Computation Overlap pipelines all-reduce gradient synchronization with backward-pass computation. By partitioning gradient tensors and scheduling communication for each partition as soon as its gradients are ready (rather than waiting for the full backward pass to complete), the authors hide communication latency behind useful compute.

Why This Gets Worse Over Time

A crucial observation in the paper is that hardware trends are making this problem worse, not better. GPU compute throughput has been growing faster than memory bandwidth across successive generations:

GPU	Peak FP16 FLOPS	HBM Bandwidth	Compute/BW Ratio
V100	125 TFLOPS	900 GB/s	139
A100	312 TFLOPS	2039 GB/s	153
H100	990 TFLOPS	3350 GB/s	296

As the ratio increases, the ridge point on the roofline moves rightward, meaning more operations fall into the memory-bound regime. An operation that was marginally compute-bound on a V100 may become memory-bound on an H100. This trend means that the data movement problem identified in the paper becomes more acute with each hardware generation, not less — making the proposed optimizations increasingly relevant.

Key Results

The profiling results are quantitative. On BERT-Large training with a V100:

GEMM operations account for roughly 60% of FLOPs but only 30–40% of wall-clock time
Non-GEMM (memory-bound) operations account for less than 5% of FLOPs but 40–60% of wall-clock time
Kernel launch overhead and memory allocation consume an additional 5–10%

After applying operator fusion, the authors report 1.3–2.0x speedups on non-GEMM operations, translating to 1.1–1.3x end-to-end training speedups. These gains increase as hardware gets faster — on an A100 with higher FLOPS but similar memory bandwidth, the memory bottleneck is more pronounced and fusion yields larger relative improvements.

For attention specifically, the softmax computation is a bottleneck because it requires a reduction across the sequence dimension (to compute the max for numerical stability and the normalization sum), followed by an elementwise exponentiation. This pattern — reduction followed by elementwise — is difficult to fuse naively but can benefit from online softmax algorithms that compute the result in a single pass.

Critical Analysis

Strengths:

The paper provides one of the first rigorous, operation-level breakdowns of where time goes in transformer training. Prior work tended to profile at the layer level, obscuring the dominance of memory-bound operations.
The roofline model framework gives practitioners a systematic way to reason about whether a given optimization will help — there is no point in algorithmic improvements for operations that are already compute-bound.
The findings generalize across model sizes and GPU architectures, making them broadly applicable.

Limitations:

The optimizations proposed are relatively standard systems techniques (fusion, layout optimization, communication overlap). The paper’s contribution is more in the analysis than in the solutions.
The profiling was conducted on V100 and A100 hardware. Newer architectures with different compute-to-bandwidth ratios (e.g., H100 with its higher memory bandwidth from HBM3) may shift the bottleneck balance, though the qualitative conclusion — that non-GEMM operations are memory-bound — is unlikely to change.
The paper does not address attention-specific optimizations like FlashAttention (Dao et al. 2022), which later demonstrated that even the “compute-bound” attention mechanism benefits from memory-aware tiling that avoids materializing the full N × N attention matrix in HBM.
Sparse attention patterns and mixture-of-experts architectures, which change the operational intensity profile, are not covered.
The paper focuses on training workloads but does not deeply analyze autoregressive inference, where the operational intensity is qualitatively different: each token generation step involves matrix-vector (not matrix-matrix) multiplications, making even the GEMM operations memory-bound.

Impact and Legacy

This paper helped establish a critical shift in how the ML systems community thinks about transformer optimization: from “reduce FLOPs” to “reduce data movement.” The insight that most transformer operations are memory-bound directly motivated subsequent work.

FlashAttention (Dao et al. 2022) is perhaps the most prominent descendant — it applies exactly the tiling-and-fusion philosophy advocated here, but extends it to the attention mechanism itself, achieving 2–4x wall-clock speedups by avoiding HBM materialization of attention matrices. FlashAttention tiles the Q, K, and V matrices into blocks that fit in SRAM, computes partial attention outputs within each tile, and accumulates results without ever writing the full N × N attention matrix to HBM. This is operator fusion taken to its logical conclusion for the attention operator.

The broader trend toward fused CUDA kernels in libraries like NVIDIA’s Transformer Engine, xFormers, and Triton-based custom kernels all reflect the core message of this paper: optimize for bytes moved, not FLOPs computed. PyTorch’s torch.compile and NVIDIA’s TensorRT both implement kernel fusion passes that automate the kind of manual optimizations the paper describes.

The paper’s roofline-based analysis framework has become a standard tool for evaluating whether a proposed architectural change (e.g., grouped-query attention, multi-query attention) will translate to actual wall-clock improvements on real hardware. When researchers propose attention variants that reduce FLOPs but increase memory traffic (or vice versa), the roofline model provides the analytical framework to predict which will actually be faster on a given device.

For LLM inference specifically, the data movement analysis explains why autoregressive token generation is so slow: generating each token requires loading the entire model’s weights from HBM (a memory-bound operation), but performs only a single matrix-vector multiply per layer. This insight directly motivates techniques like KV caching, speculative decoding, and batching strategies that amortize weight loading across multiple tokens or sequences.

Attention Is All You Need — the original transformer architecture whose operations this paper profiles
Optimizing Transformer Inference — survey covering complementary optimization techniques including pruning, quantization, and knowledge distillation
Vision Transformer — ViT extends transformers to vision, introducing the same data movement challenges in a new domain
FlashAttention — applies this paper’s tiling-and-fusion philosophy directly to the attention operator