Overview
Modern deep learning models have grown too large to fit on a single GPU. Training GPT-4 or Llama 70B requires distributing computation across hundreds or thousands of GPUs. Distributed parallelism provides three orthogonal strategies to scale training: Data Parallel (split the batch), Tensor Parallel (split the layers), and Pipeline Parallel (split the model stages).
Understanding when to use each strategy—and how to combine them in 3D parallelism—is essential for efficient large-scale training. This guide covers the fundamentals of each approach, their trade-offs, and practical implementation with PyTorch and DeepSpeed.
Key Concepts
Data Parallelism (DDP)
Replicate the full model on each GPU and split the training batch. All-reduce gradients after backward pass to keep models synchronized.
Tensor Parallelism (TP)
Split individual weight matrices across GPUs. Each GPU computes partial results that are combined via all-reduce or all-gather per layer.
Pipeline Parallelism (PP)
Partition the model into sequential stages. Each GPU processes a different set of layers, passing activations forward through the pipeline.
3D Parallelism
Combine DP, TP, and PP to scale to thousands of GPUs. TP within nodes (NVLink), PP across nodes, DP for throughput scaling.
ZeRO Optimization
Partition optimizer states, gradients, and parameters across data-parallel ranks to reduce memory footprint without changing the parallelism model.
Communication Collectives
All-reduce (average gradients), all-gather (collect distributed data), reduce-scatter (scatter reduced results) form the backbone of distributed training.
Three Dimensions of Parallelism
Data Parallel
Same model on all GPUs. Split batch, sync gradients. Simple and efficient when model fits in memory.
Tensor Parallel
Split weight matrices across GPUs. High communication overhead—needs NVLink within nodes.
Pipeline Parallel
Split model by layers. Point-to-point communication. Use micro-batching to reduce bubble time.
Data Parallelism: Split the Batch
Data parallelism is the simplest and most widely used strategy. Each GPU holds a complete copy of the model and processes a different portion of the training batch. After computing local gradients, GPUs synchronize via all-reduce to average gradients before updating weights.
DDP Training Flow
Load Data
Each GPU loads its portion of the batch
Synchronous Training
All GPUs must complete their batch before gradients are synchronized. Slowest GPU determines throughput.
Gradient Averaging
All-reduce computes the mean gradient across all replicas, ensuring all models stay synchronized.
When to Use Data Parallelism
- Model fits comfortably in single GPU memory
- Need to increase training throughput
- Simple setup with minimal code changes
PyTorch DistributedDataParallel Setup
import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP # Initialize process group with NCCL backend dist.init_process_group(backend='nccl') local_rank = int(os.environ['LOCAL_RANK']) torch.cuda.set_device(local_rank) # Create model and wrap with DDP model = MyModel().cuda(local_rank) model = DDP(model, device_ids=[local_rank]) # Training loop - gradients are automatically synchronized for batch in dataloader: inputs, labels = batch.cuda(local_rank), labels.cuda(local_rank) loss = criterion(model(inputs), labels) loss.backward() # All-reduce happens here optimizer.step() optimizer.zero_grad()
- Synchronous training: All GPUs must complete before the next iteration
- Gradient averaging: All-reduce divides by world size automatically
- Bucket fusion: Small gradients are grouped for efficient communication
- Find unused parameters: Set
find_unused_parameters=Truefor dynamic graphs
Tensor Parallelism: Split Within Layers
When a single layer is too large for one GPU, tensor parallelism splits weight matrices horizontally. Each GPU performs part of the matrix multiplication, then combines results via collective communication.
Tensor Parallel Matrix Splitting
Column Parallel (Split Output Features)
- • Weight matrix split along columns (output dimension)
- • Each GPU computes partial output features
- • All-Gather concatenates results
- • Used in: MLP first linear layer, attention QKV projection
Row Parallel (Split Input Features)
- • Weight matrix split along rows (input dimension)
- • Input must already be partitioned across GPUs
- • All-Reduce sums partial outputs
- • Used in: MLP second linear layer, attention output projection
Why Pair Column + Row Parallel?
In transformer MLPs, the first linear uses column parallel (splits hidden dimension), and the second uses row parallel (input already partitioned). This pairs All-Gather with All-Reduce, minimizing total communication per layer.
Column vs Row Parallel
In transformer MLPs, we pair column parallel (first linear) with row parallel (second linear):
- Column Parallel: Split output dimension, each GPU computes partial features, all-gather to combine
- Row Parallel: Split input dimension (already partitioned), each GPU computes partial output, all-reduce to sum
This pairing minimizes communication by doing one all-gather and one all-reduce per MLP block instead of two of each.
Column vs Row Parallel Deep Dive
In tensor parallelism, we split weight matrices across GPUs. But there are exactly two ways to slice a matrix: by columns or by rows. These aren't just different cuts—they have fundamentally different properties for how inputs must be distributed and how outputs must be combined.
Column Parallel: Split by Columns
In column-parallel mode, we slice the weight matrix vertically—each GPU gets a subset of the columns. Since each column of W produces one element of the output, each GPU computes a portion of the output features.
The Key Property
Input is replicated (same X on all GPUs), output is split (each GPU has part of Y). To get the full output, we must concatenate the partial results—this requires an all-gather operation.
Column-Parallel Mathematics
If W has shape [H_in, H_out] and we use TP = N GPUs:
Each GPU i computes:
Final output requires concatenation:
Row Parallel: Split by Rows
In row-parallel mode, we slice the weight matrix horizontally—each GPU gets a subset of the rows. Since each row of W corresponds to one input feature, the input X must also be split. Each GPU computes a partial sum that must be combined.
The Key Property
Input is split (each GPU gets part of X along the last dimension),output needs summation. Each GPU produces a partial result with the same shape—to get the final output, we must sum them via all-reduce.
Row-Parallel Mathematics
If W has shape [H_in, H_out] and we use TP = N GPUs:
Input X must also be split (along last dimension):
Each GPU i computes a partial product:
Final output requires summation:
The Critical Difference
| Aspect | Column Parallel | Row Parallel |
|---|---|---|
| Weight split direction | Vertically (by columns) | Horizontally (by rows) |
| Input requirement | Replicated (same on all GPUs) | Split along last dimension |
| Partial output shape | [B, S, H_out/N] — different features | [B, S, H_out] — partial sums |
| Communication op | All-Gather (concatenate) | All-Reduce (sum) |
| Output state | Replicated | Replicated |
| Comm volume | ~M bytes | ~2M bytes (reduce-scatter + all-gather) |
The Perfect Pair: Why They're Complementary
Here's the magic insight that makes Megatron-style tensor parallelism work: column-parallel output is already split in exactly the format row-parallel needs as input!
- Column-parallel output state: Each GPU has [B, S, H/N] — split along last dim
- Row-parallel input requirement: Needs [B, S, H/N] — split along last dim
- Result: We can feed column-parallel output directly to row-parallel input with zero communication!
- Total comm: Just 1 all-reduce at the end (after row-parallel), not 2 separate operations
Communication Volume Analysis
ALL-GATHER (Column Parallel)
Each GPU sends: M/N bytes
Each GPU receives: M × (N-1)/N bytes
Total per GPU: ≈ M bytes
ALL-REDUCE (Row Parallel)
Reduce-scatter: M × (N-1)/N
All-gather: M × (N-1)/N
Total per GPU: ≈ 2M bytes
All-reduce costs roughly 2× the bandwidth of all-gather. This is why the Megatron pattern places row-parallel at the end of each block—where we need replicated output anyway (for residual connections).
Column Parallel: Keep input whole, split output features. Good for the "up projection" where we expand dimensions.
Row Parallel: Split input features, keep output whole. Good for the "down projection" where we reduce dimensions.
Together: They form a perfect pair. Column-parallel's split output is exactly what row-parallel needs as split input. Chain them to eliminate intermediate communication. This is the foundation of all modern tensor parallel implementations.
Tensor parallelism requires communication after every layer, not just after the backward pass. This demands high-bandwidth interconnects like NVLink (600+ GB/s). Using TP across nodes with only InfiniBand (100-400 GB/s) will severely bottleneck training.
Pipeline Parallelism: Split by Stages
Pipeline parallelism partitions the model into sequential stages, with each GPU responsible for a subset of layers. Micro-batching keeps GPUs busy by processing multiple smaller batches simultaneously.
Pipeline Schedules: GPipe vs 1F1B
GPipe
All forwards first, then all backwards. Better pipelining with micro-batches.
Parameters
1F1B Memory Advantage
1F1B maintains at most p activations in memory (one per stage), while GPipe stores m activations until backward begins. With large micro-batch counts, 1F1B uses significantly less memory.
The Bubble Problem
In naive pipelining, GPUs sit idle waiting for activations from previous stages. This creates "bubbles" of wasted compute time. The bubble ratio depends on the schedule:
- Naive: ~75% bubble time with 4 stages
- GPipe:
(p-1)/mbubble ratio (p=stages, m=micro-batches) - 1F1B:
(p-1)/(m+p-1)bubble ratio, plus lower memory usage
1F1B (One Forward One Backward) interleaves forward and backward passes, allowing earlier activation release. GPipe stores m activations per stage until backward begins. With p stages, 1F1B uses only p activations in flight regardless of micro-batch count.
3D Parallelism: Combining All Three
For training the largest models (GPT-3, Llama 70B+), we combine all three strategies. The key insight is to match parallelism type to interconnect bandwidth:
3D Parallelism Configuration
Split batch across replicas
Split layers within node
Split model by stages
TP Communication
Highest bandwidth needed. Use NVLink within node (600+ GB/s). All-reduce/all-gather per layer.
PP Communication
Point-to-point between stages. Moderate bandwidth needed. Can span nodes via InfiniBand.
DP Communication
Gradient sync once per iteration. Can span multiple nodes. All-reduce over network.
Example: Training Llama 70B on 64 GPUs
# Megatron-LM style configuration TP=8 # 8-way tensor parallel within DGX node (NVLink) PP=4 # 4-way pipeline parallel across nodes DP=2 # 2-way data parallel for larger batches # Total: 8 × 4 × 2 = 64 GPUs
Communication Hierarchy
| Dimension | Communication Pattern | Frequency | Best Interconnect |
|---|---|---|---|
| TP | All-reduce/gather per layer | Very high | NVLink (600+ GB/s) |
| PP | Point-to-point between stages | Medium | InfiniBand (200+ GB/s) |
| DP | All-reduce gradients per iteration | Low | Network (any bandwidth) |
Megatron-LM 3D Parallelism Configuration
# Training Llama 70B on 64 H100 GPUs (8 nodes × 8 GPUs) python train.py \ --tensor-model-parallel-size 8 \ # TP=8 within node (NVLink) --pipeline-model-parallel-size 4 \ # PP=4 across nodes --data-parallel-size 2 \ # DP=2 for larger batches --micro-batch-size 1 \ --global-batch-size 1024 \ --num-layers 80 \ --hidden-size 8192 \ --num-attention-heads 64 # Total: 8 × 4 × 2 = 64 GPUs
ZeRO: Memory-Efficient Data Parallelism
ZeRO (Zero Redundancy Optimizer) reduces memory footprint by partitioning optimizer states, gradients, and parameters across data-parallel ranks instead of replicating them.
ZeRO Memory Optimization Stages
Standard DDP
Full replication on every GPU
ZeRO-1
Partition optimizer states
ZeRO-2
Partition optimizer + gradients
ZeRO-3
Partition everything
Communication Tradeoff
Each ZeRO stage reduces memory at the cost of more communication. ZeRO-1 has minimal overhead (optimizer states rarely accessed). ZeRO-3 requires all-gather for every forward pass, adding ~50% communication overhead but enabling training of models 8× larger.
DeepSpeed ZeRO-3 Config
{ "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu" }, "offload_param": { "device": "cpu" }, "overlap_comm": true, "contiguous_gradients": true } }
Memory Breakdown
For a model with Φ parameters using Adam optimizer with mixed precision:
| Component | Memory (per GPU) | Notes |
|---|---|---|
| FP16 Parameters | 2Φ | Model weights |
| FP16 Gradients | 2Φ | Computed during backward |
| FP32 Optimizer States | 12Φ | Adam: params + momentum + variance |
| Total | 16Φ | Standard DDP per GPU |
ZeRO Stages
- ZeRO-1: Partition optimizer states → 4Φ per GPU (4× memory reduction)
- ZeRO-2: + Partition gradients → 2.5Φ per GPU (8× memory reduction)
- ZeRO-3: + Partition parameters → ~1Φ per GPU (16× memory reduction)
DeepSpeed ZeRO-3 Configuration
{ "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto" }, "fp16": { "enabled": true, "loss_scale_window": 100 } }
PyTorch FSDP (Fully Sharded Data Parallel) implements ZeRO-3 natively in PyTorch. DeepSpeed ZeRO offers more configuration options and offloading capabilities, while FSDP provides tighter integration with PyTorch ecosystem.
Choosing the Right Strategy
Parallelism Decision Tree
Does model fit in single GPU memory?
DDP
Model fits, scale throughput
Tensor Parallel
Within-node, NVLink
Pipeline Parallel
Cross-node, by layers
ZeRO/FSDP
Memory optimization
Strategy Comparison
| Strategy | Memory Efficiency | Communication | Complexity | Best For |
|---|---|---|---|---|
| DDP | Low (full replication) | Low (once per iter) | Simple | Small models, throughput scaling |
| ZeRO-1/2 | Medium | Low-Medium | Moderate | Memory-constrained DDP |
| ZeRO-3/FSDP | High | High | Moderate | Very large models, simple API |
| TP | High | Very High | Complex | Large models within node |
| PP | High | Medium | Complex | Cross-node, layer-wise split |
| 3D Parallel | Highest | Optimized | Most Complex | Largest models (100B+) |
Real-World Applications
Fine-tuning LLMs
Fine-tuning a 7B model on consumer GPUs
Pre-training Foundation Models
Training models from scratch at massive scale
Multi-GPU Inference
Serving large models that exceed single GPU memory
Research with Limited Resources
Training large models on university clusters
Production Training Pipelines
Consistent, reproducible large-scale training
Mixture-of-Experts Training
Training sparse models with expert parallelism
Best Practices
- Start with DDP: Begin with data parallelism and only add complexity when needed. DDP is well-tested and has minimal overhead.
- Match Parallelism to Interconnect: Use TP within nodes (NVLink), PP across nodes (InfiniBand), and DP anywhere. Never use TP across nodes without NVSwitch.
- Profile Before Optimizing: Use NVIDIA Nsight Systems to identify whether you are compute-bound or communication-bound before adding parallelism.
- Tune Micro-batch Size: For pipeline parallelism, use at least 4× as many micro-batches as stages to minimize bubble overhead.
- Enable Gradient Checkpointing: Trade compute for memory by recomputing activations during backward pass. Essential for very deep models.
- Use Mixed Precision: FP16/BF16 training halves memory usage and doubles throughput on Tensor Cores. Always enable when possible.
Common Pitfalls to Avoid
Tensor Parallel Across Nodes
Using TP across nodes without NVLink results in 10-30× slower layer computation due to InfiniBand latency
Incorrect World Size Calculation
3D parallelism requires world_size = DP × TP × PP; mismatched sizes cause deadlocks
Memory Fragmentation with ZeRO-3
Frequent all-gather and release of parameters can fragment GPU memory
Unbalanced Pipeline Stages
Stages with different computational costs create bubbles even with good scheduling
Advantages & Limitations
Advantages
- ✓Train models 100× larger than single GPU memory allows
- ✓Near-linear scaling efficiency with proper configuration
- ✓Flexibility to trade memory for compute and vice versa
- ✓Production-proven at scale (GPT-4, Llama, PaLM)
- ✓Well-supported by PyTorch, DeepSpeed, Megatron-LM
- ✓CPU/NVMe offloading extends effective memory further
Limitations
- ×Significant engineering complexity for 3D parallelism
- ×Requires high-bandwidth interconnects for efficiency
- ×Debugging distributed training failures is challenging
- ×Communication overhead limits scaling efficiency
- ×Requires careful hyperparameter tuning for each configuration
- ×Different parallelism strategies have different numerical behaviors
