Overview
Modern deep learning models have grown too large to fit on a single GPU. Training GPT-4 or Llama 70B requires distributing computation across hundreds or thousands of GPUs. Distributed parallelism provides three orthogonal strategies to scale training: Data Parallel (split the batch), Tensor Parallel (split the layers), and Pipeline Parallel (split the model stages).
Understanding when to use each strategy—and how to combine them in 3D parallelism—is essential for efficient large-scale training. This guide covers the fundamentals of each approach, their trade-offs, and practical implementation with PyTorch and DeepSpeed.
Key Concepts
Data Parallelism (DDP)
Replicate the full model on each GPU and split the training batch. All-reduce gradients after backward pass to keep models synchronized.
Tensor Parallelism (TP)
Split individual weight matrices across GPUs. Each GPU computes partial results that are combined via all-reduce or all-gather per layer.
Pipeline Parallelism (PP)
Partition the model into sequential stages. Each GPU processes a different set of layers, passing activations forward through the pipeline.
3D Parallelism
Combine DP, TP, and PP to scale to thousands of GPUs. TP within nodes (NVLink), PP across nodes, DP for throughput scaling.
ZeRO Optimization
Partition optimizer states, gradients, and parameters across data-parallel ranks to reduce memory footprint without changing the parallelism model.
Communication Collectives
All-reduce (average gradients), all-gather (collect distributed data), reduce-scatter (scatter reduced results) form the backbone of distributed training.
Three Dimensions of Parallelism
Data Parallel
Same model on all GPUs. Split batch, sync gradients. Simple and efficient when model fits in memory.
Tensor Parallel
Split weight matrices across GPUs. High communication overhead—needs NVLink within nodes.
Pipeline Parallel
Split model by layers. Point-to-point communication. Use micro-batching to reduce bubble time.
Data Parallelism: Split the Batch
Data parallelism is the simplest and most widely used strategy. Each GPU holds a complete copy of the model and processes a different portion of the training batch. After computing local gradients, GPUs synchronize via all-reduce to average gradients before updating weights.
DDP Training Flow
When to Use Data Parallelism
- Model fits comfortably in single GPU memory
- Need to increase training throughput
- Simple setup with minimal code changes
PyTorch DistributedDataParallel Setup
import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP # Initialize process group with NCCL backend dist.init_process_group(backend='nccl') local_rank = int(os.environ['LOCAL_RANK']) torch.cuda.set_device(local_rank) # Create model and wrap with DDP model = MyModel().cuda(local_rank) model = DDP(model, device_ids=[local_rank]) # Training loop - gradients are automatically synchronized for batch in dataloader: inputs, labels = batch.cuda(local_rank), labels.cuda(local_rank) loss = criterion(model(inputs), labels) loss.backward() # All-reduce happens here optimizer.step() optimizer.zero_grad()
- Synchronous training: All GPUs must complete before the next iteration
- Gradient averaging: All-reduce divides by world size automatically
- Bucket fusion: Small gradients are grouped for efficient communication
- Find unused parameters: Set
find_unused_parameters=Truefor dynamic graphs
Tensor Parallelism: Split Within Layers
When a single layer is too large for one GPU, tensor parallelism splits weight matrices horizontally. Each GPU performs part of the matrix multiplication, then combines results via collective communication.
Tensor Parallel Matrix Splitting
Column vs Row Parallel
In transformer MLPs, we pair column parallel (first linear) with row parallel (second linear):
- Column Parallel: Split output dimension, each GPU computes partial features, all-gather to combine
- Row Parallel: Split input dimension (already partitioned), each GPU computes partial output, all-reduce to sum
This pairing minimizes communication by doing one all-gather and one all-reduce per MLP block instead of two of each.
Column vs Row Parallel Deep Dive
Tensor parallelism requires communication after every layer, not just after the backward pass. This demands high-bandwidth interconnects like NVLink (600+ GB/s). Using TP across nodes with only InfiniBand (100-400 GB/s) will severely bottleneck training.
Pipeline Parallelism: Split by Stages
Pipeline parallelism partitions the model into sequential stages, with each GPU responsible for a subset of layers. Micro-batching keeps GPUs busy by processing multiple smaller batches simultaneously.
Pipeline Schedules: GPipe vs 1F1B
The Bubble Problem
In naive pipelining, GPUs sit idle waiting for activations from previous stages. This creates "bubbles" of wasted compute time. The bubble ratio depends on the schedule:
- Naive: ~75% bubble time with 4 stages
- GPipe:
(p-1)/mbubble ratio (p=stages, m=micro-batches) - 1F1B:
(p-1)/(m+p-1)bubble ratio, plus lower memory usage
1F1B (One Forward One Backward) interleaves forward and backward passes, allowing earlier activation release. GPipe stores m activations per stage until backward begins. With p stages, 1F1B uses only p activations in flight regardless of micro-batch count.
3D Parallelism: Combining All Three
For training the largest models (GPT-3, Llama 70B+), we combine all three strategies. The key insight is to match parallelism type to interconnect bandwidth:
3D Parallelism Configuration
Communication Hierarchy
| Dimension | Communication Pattern | Frequency | Best Interconnect |
|---|---|---|---|
| TP | All-reduce/gather per layer | Very high | NVLink (600+ GB/s) |
| PP | Point-to-point between stages | Medium | InfiniBand (200+ GB/s) |
| DP | All-reduce gradients per iteration | Low | Network (any bandwidth) |
Megatron-LM 3D Parallelism Configuration
# Training Llama 70B on 64 H100 GPUs (8 nodes × 8 GPUs) python train.py \ --tensor-model-parallel-size 8 \ # TP=8 within node (NVLink) --pipeline-model-parallel-size 4 \ # PP=4 across nodes --data-parallel-size 2 \ # DP=2 for larger batches --micro-batch-size 1 \ --global-batch-size 1024 \ --num-layers 80 \ --hidden-size 8192 \ --num-attention-heads 64 # Total: 8 × 4 × 2 = 64 GPUs
ZeRO: Memory-Efficient Data Parallelism
ZeRO (Zero Redundancy Optimizer) reduces memory footprint by partitioning optimizer states, gradients, and parameters across data-parallel ranks instead of replicating them.
ZeRO Memory Optimization Stages
Standard DDP
Full replication on every GPU
ZeRO-1
Partition optimizer states
ZeRO-2
Partition optimizer + gradients
ZeRO-3
Partition everything
Communication Tradeoff
Each ZeRO stage reduces memory at the cost of more communication. ZeRO-1 has minimal overhead (optimizer states rarely accessed). ZeRO-3 requires all-gather for every forward pass, adding ~50% communication overhead but enabling training of models 8× larger.
DeepSpeed ZeRO-3 Config
{ "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu" }, "offload_param": { "device": "cpu" }, "overlap_comm": true, "contiguous_gradients": true } }
Memory Breakdown
For a model with Φ parameters using Adam optimizer with mixed precision:
| Component | Memory (per GPU) | Notes |
|---|---|---|
| FP16 Parameters | 2Φ | Model weights |
| FP16 Gradients | 2Φ | Computed during backward |
| FP32 Optimizer States | 12Φ | Adam: params + momentum + variance |
| Total | 16Φ | Standard DDP per GPU |
ZeRO Stages
- ZeRO-1: Partition optimizer states → 4Φ per GPU (4× memory reduction)
- ZeRO-2: + Partition gradients → 2.5Φ per GPU (8× memory reduction)
- ZeRO-3: + Partition parameters → ~1Φ per GPU (16× memory reduction)
DeepSpeed ZeRO-3 Configuration
{ "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto" }, "fp16": { "enabled": true, "loss_scale_window": 100 } }
PyTorch FSDP (Fully Sharded Data Parallel) implements ZeRO-3 natively in PyTorch. DeepSpeed ZeRO offers more configuration options and offloading capabilities, while FSDP provides tighter integration with PyTorch ecosystem.
Choosing the Right Strategy
Parallelism Decision Tree
Strategy Comparison
| Strategy | Memory Efficiency | Communication | Complexity | Best For |
|---|---|---|---|---|
| DDP | Low (full replication) | Low (once per iter) | Simple | Small models, throughput scaling |
| ZeRO-1/2 | Medium | Low-Medium | Moderate | Memory-constrained DDP |
| ZeRO-3/FSDP | High | High | Moderate | Very large models, simple API |
| TP | High | Very High | Complex | Large models within node |
| PP | High | Medium | Complex | Cross-node, layer-wise split |
| 3D Parallel | Highest | Optimized | Most Complex | Largest models (100B+) |
Real-World Applications
Fine-tuning LLMs
Fine-tuning a 7B model on consumer GPUs
Pre-training Foundation Models
Training models from scratch at massive scale
Multi-GPU Inference
Serving large models that exceed single GPU memory
Research with Limited Resources
Training large models on university clusters
Production Training Pipelines
Consistent, reproducible large-scale training
Mixture-of-Experts Training
Training sparse models with expert parallelism
Best Practices
- Start with DDP: Begin with data parallelism and only add complexity when needed. DDP is well-tested and has minimal overhead.
- Match Parallelism to Interconnect: Use TP within nodes (NVLink), PP across nodes (InfiniBand), and DP anywhere. Never use TP across nodes without NVSwitch.
- Profile Before Optimizing: Use NVIDIA Nsight Systems to identify whether you are compute-bound or communication-bound before adding parallelism.
- Tune Micro-batch Size: For pipeline parallelism, use at least 4× as many micro-batches as stages to minimize bubble overhead.
- Enable Gradient Checkpointing: Trade compute for memory by recomputing activations during backward pass. Essential for very deep models.
- Use Mixed Precision: FP16/BF16 training halves memory usage and doubles throughput on Tensor Cores. Always enable when possible.
Common Pitfalls to Avoid
Tensor Parallel Across Nodes
Using TP across nodes without NVLink results in 10-30× slower layer computation due to InfiniBand latency
Incorrect World Size Calculation
3D parallelism requires world_size = DP × TP × PP; mismatched sizes cause deadlocks
Memory Fragmentation with ZeRO-3
Frequent all-gather and release of parameters can fragment GPU memory
Unbalanced Pipeline Stages
Stages with different computational costs create bubbles even with good scheduling
Advantages & Limitations
Advantages
- ✓Train models 100× larger than single GPU memory allows
- ✓Near-linear scaling efficiency with proper configuration
- ✓Flexibility to trade memory for compute and vice versa
- ✓Production-proven at scale (GPT-4, Llama, PaLM)
- ✓Well-supported by PyTorch, DeepSpeed, Megatron-LM
- ✓CPU/NVMe offloading extends effective memory further
Limitations
- ×Significant engineering complexity for 3D parallelism
- ×Requires high-bandwidth interconnects for efficiency
- ×Debugging distributed training failures is challenging
- ×Communication overhead limits scaling efficiency
- ×Requires careful hyperparameter tuning for each configuration
- ×Different parallelism strategies have different numerical behaviors
