TL;DR
Transformer models have become the dominant architecture across NLP and computer vision, but their inference cost — in latency, memory, and energy — is a major deployment bottleneck. This survey systematically covers five families of optimization techniques: knowledge distillation, pruning, quantization, efficient architecture design (including attention approximations), and hardware-level acceleration. It provides a taxonomy of methods within each family and discusses how they compose, giving practitioners a structured map of the optimization landscape as of mid-2023.
The Inference Cost Problem
The computational cost of transformer inference scales quadratically with sequence length due to self-attention and linearly with model width and depth. For a transformer with L layers, hidden dimension d, and sequence length n, the FLOPs per forward pass are approximately:
The first term covers the linear projections (Q, K, V, output, and two FFN layers), while the second term covers attention score computation and value aggregation. For large language models with d = 4096 and n = 2048, the linear projection term dominates. For long-context models with n > d, the quadratic attention term becomes the bottleneck.
Beyond FLOPs, inference is constrained by memory bandwidth (loading model weights from HBM for each token during autoregressive decoding), memory capacity (storing KV caches that grow linearly with sequence length), and latency (sequential token generation in autoregressive models cannot be parallelized). The survey organizes optimization techniques around reducing one or more of these costs.
Knowledge Distillation
Knowledge distillation (KD) trains a smaller “student” model to mimic a larger “teacher” model, compressing the model while retaining much of the teacher’s accuracy. The standard KD loss minimizes the KL divergence between teacher and student output distributions:
where pT and pS are the softened output distributions of teacher and student, and α balances distillation against the task loss.
The survey categorizes KD methods by what knowledge is transferred:
- Output-level distillation (DistilBERT, TinyBERT): match the teacher’s logits or soft labels. DistilBERT reduces BERT’s parameters by 40% while retaining 97% of its performance.
- Attention-level distillation: match the teacher’s attention maps, forcing the student to learn similar attention patterns. This provides a stronger learning signal than output matching alone.
- Hidden-state distillation: match intermediate representations, layer-by-layer. This requires a mapping between teacher and student layers (since they may differ in depth).
- Task-agnostic vs. task-specific: task-agnostic distillation pre-trains a general student, while task-specific distillation fine-tunes on a specific downstream dataset with teacher guidance.
Pruning
Pruning removes redundant parameters or structures from a trained model. The survey organizes pruning along three axes: saliency criterion, sparsity pattern, and granularity.
Saliency criteria determine which parameters to remove. Zeroth-order methods use weight magnitude (remove the smallest weights). First-order methods use gradient information (remove weights whose removal changes the loss least, estimated via \Delta ℒ ≈ gT · δ w where g is the gradient). Second-order methods use the Hessian to estimate the impact of removal more precisely, at higher computational cost.
Sparsity patterns range from unstructured (any individual weight can be pruned, producing irregular sparse matrices) to structured (entire attention heads, FFN neurons, or full layers are removed). Unstructured pruning achieves higher compression ratios — transformers can often tolerate 50–70% unstructured sparsity with minimal accuracy loss — but structured pruning yields direct speedups on standard hardware without sparse matrix support. Semi-structured patterns like NVIDIA’s 2:4 sparsity (exactly 2 zeros in every block of 4 elements) offer a hardware-friendly middle ground with native acceleration on Ampere and newer GPUs.
Attention head pruning is a transformer-specific technique that removes entire attention heads determined to be redundant. Studies have shown that in BERT-Base (12 heads per layer, 12 layers), many heads can be removed with minimal impact, suggesting substantial redundancy in the standard multi-head attention design.
Quantization
Quantization reduces the numerical precision of weights and activations from 32-bit floating point to lower bitwidths. The mapping from a floating-point value x to a quantized integer xq is:
where s is a scale factor and z is a zero-point offset. The survey covers several quantization strategies:
- Post-training quantization (PTQ) quantizes a pre-trained model without retraining. W8A8 (8-bit weights and activations) typically preserves accuracy well. More aggressive W4A16 (4-bit weights, 16-bit activations) reduces memory by 4x and accelerates memory-bound inference in autoregressive decoding.
- Quantization-aware training (QAT) simulates quantization during training via straight-through estimators, allowing the model to adapt to lower precision. QAT generally achieves better accuracy than PTQ at the same bitwidth, but requires full retraining.
- Mixed-precision quantization assigns different bitwidths to different layers or operations based on sensitivity analysis. Attention projections and output layers tend to be more sensitive to quantization than FFN layers, so mixed-precision schemes keep critical layers at higher precision.
A persistent challenge is activation outliers: certain hidden dimensions in transformer activations can have values 10–100x larger than the median, making uniform quantization destructive. Techniques like SmoothQuant (Xiao et al. 2023) address this by mathematically migrating the quantization difficulty from activations to weights, which have smoother distributions.
Efficient Attention Approximations
Standard self-attention computes A = \text{softmax}(QKT / √(dk))V, which requires O(n2) time and memory. The survey covers several families of efficient alternatives:
- Sparse attention (Longformer, BigBird) restricts each token to attend to a subset of positions — local windows, global tokens, and random connections — reducing cost to O(n · k) where k ≪ n.
- Linear attention (Performer, Random Feature Attention) replaces the softmax kernel with an approximate feature map φ such that \text{softmax}(QKT) ≈ φ(Q)φ(K)T. This allows computing attention in O(n · d) by changing the order of matrix multiplication.
- LSH attention (Reformer) uses locality-sensitive hashing to group similar queries and keys, computing attention only within each hash bucket.
- Multi-query and grouped-query attention (MQA, GQA) share key and value heads across multiple query heads, reducing the KV cache size by h/g where h is the number of query heads and g is the number of KV groups.
The survey notes that these approximations involve quality-efficiency trade-offs. Sparse and linear attention methods often degrade performance on tasks requiring full-context reasoning, while MQA/GQA maintain quality better because they do not approximate the attention computation itself — they reduce memory overhead without changing the mathematical operation.
Hardware-Level Optimizations
The final category covers optimizations at the hardware and systems level:
- Operator fusion combines multiple elementwise operations (layer norm, GELU, bias addition) into single GPU kernels, reducing HBM round-trips. FlashAttention is the most impactful example, fusing the entire attention computation (Q, K, V multiply, softmax, and value aggregation) into a single tiled kernel that avoids materializing the n × n attention matrix.
- KV cache management techniques like PagedAttention (used in vLLM) apply virtual memory concepts to the KV cache, reducing memory fragmentation and enabling higher batch sizes during serving.
- Speculative decoding uses a small draft model to generate candidate tokens in parallel, which the large model then verifies in a single forward pass. This converts sequential token generation into partially parallel verification, reducing latency without changing output quality.
- Custom hardware accelerators (Groq, Cerebras, custom FPGA designs) exploit transformer-specific computation patterns for higher throughput and energy efficiency than general-purpose GPUs.
Critical Analysis
Strengths:
- The taxonomy is well-organized and covers the field broadly. The five-category framework (distillation, pruning, quantization, architecture, hardware) provides a useful mental model for practitioners evaluating optimization strategies.
- The survey covers both NLP and vision transformers, noting where techniques differ between modalities. Pruning strategies that work for BERT do not always transfer to ViT, and the survey highlights these distinctions.
- The discussion of composability — how pruning and quantization can be applied together, or how distillation can target an already-pruned model — is practically valuable.
Limitations:
- The survey was published in mid-2023 and consequently misses several important developments: GGUF quantization formats, AWQ (Activation-aware Weight Quantization), and the rapid progress in 2-bit and 1.58-bit quantization (BitNet).
- The treatment of each technique is necessarily shallow given the survey’s breadth. Practitioners working on a specific optimization (e.g., PTQ for LLMs) will need to consult dedicated papers for implementation details.
- The paper lacks empirical comparisons across methods. It reports results from individual papers but does not run controlled experiments to compare techniques on the same models and hardware, making apples-to-apples assessment difficult.
- The focus is predominantly on single-device optimization. Multi-device serving strategies (tensor parallelism, pipeline parallelism, expert parallelism for MoE models) receive limited coverage despite being critical for LLM deployment at scale.
Impact and Legacy
This survey serves as a reference map for the transformer optimization landscape. Its primary value is in providing structure to a rapidly growing field — rather than introducing new techniques, it organizes existing work into a coherent taxonomy that helps researchers identify gaps and practitioners select appropriate methods.
The broader trend the survey documents — the shift from “make models bigger” to “make inference cheaper” — has only accelerated since publication. The techniques it covers now form the standard toolkit for deploying LLMs in production: quantization (GPTQ, AWQ, bitsandbytes), KV cache optimization (PagedAttention in vLLM and TGI), speculative decoding (Medusa, EAGLE), and operator fusion (FlashAttention, FlashDecoding). Understanding this taxonomy is essential background for anyone working on model serving.
Related Reading
- Attention Is All You Need — the original transformer architecture whose inference costs this survey addresses
- Data Movement Is All You Need — complementary analysis showing that data movement, not arithmetic, is the primary bottleneck in transformer execution
- Vision Transformer (ViT) — extends transformers to vision, introducing additional inference optimization challenges for dense prediction tasks
- DINO — self-supervised ViT training that produces models requiring the same inference optimization techniques surveyed here
