Skip to main content

Distributed Parallelism in Deep Learning

GPU distributed parallelism: Data Parallel (DDP), Tensor Parallel, Pipeline Parallel, and ZeRO optimization for training large AI models.

Best viewed on desktop for optimal interactive experience

Overview

Modern deep learning models have grown too large to fit on a single GPU. Training GPT-4 or Llama 70B requires distributing computation across hundreds or thousands of GPUs. Distributed parallelism provides three orthogonal strategies to scale training: Data Parallel (split the batch), Tensor Parallel (split the layers), and Pipeline Parallel (split the model stages).

Understanding when to use each strategy—and how to combine them in 3D parallelism—is essential for efficient large-scale training. This guide covers the fundamentals of each approach, their trade-offs, and practical implementation with PyTorch and DeepSpeed.

Key Concepts

Data Parallelism (DDP)

Replicate the full model on each GPU and split the training batch. All-reduce gradients after backward pass to keep models synchronized.

Tensor Parallelism (TP)

Split individual weight matrices across GPUs. Each GPU computes partial results that are combined via all-reduce or all-gather per layer.

Pipeline Parallelism (PP)

Partition the model into sequential stages. Each GPU processes a different set of layers, passing activations forward through the pipeline.

3D Parallelism

Combine DP, TP, and PP to scale to thousands of GPUs. TP within nodes (NVLink), PP across nodes, DP for throughput scaling.

ZeRO Optimization

Partition optimizer states, gradients, and parameters across data-parallel ranks to reduce memory footprint without changing the parallelism model.

Communication Collectives

All-reduce (average gradients), all-gather (collect distributed data), reduce-scatter (scatter reduced results) form the backbone of distributed training.

Three Dimensions of Parallelism

DATA PARALLELISMSplit the BATCHFull Batch (32 samples)GPU 0: 8GPU 1: 8GPU 2: 8GPU 3: 8Each GPU has FULL modelModelCopyModelCopyModelCopyModelCopyAll-Reduce gradientsafter backward passMemory: Full model per GPUTENSOR PARALLELISMSplit each LAYER horizontallyLinear Layer[4096 × 16384]GPU 0[4096×4096]GPU 1[4096×4096]GPU 2[4096×4096]GPU 3[4096×4096]All-reduce after each layerHigh bandwidth needed!Best within nodeNVLink: 600 GB/sMemory: 1/N per GPUPIPELINE PARALLELISMSplit LAYERS verticallyGPU 0: Layers 0-7GPU 1: Layers 8-15GPU 2: Layers 16-23GPU 3: Layers 24-31Micro-batches keep GPUs busyBubble: idle time between stagesMemory: 1/P layers per GPU

Data Parallel

Same model on all GPUs. Split batch, sync gradients. Simple and efficient when model fits in memory.

Tensor Parallel

Split weight matrices across GPUs. High communication overhead—needs NVLink within nodes.

Pipeline Parallel

Split model by layers. Point-to-point communication. Use micro-batching to reduce bubble time.

Data Parallelism: Split the Batch

Data parallelism is the simplest and most widely used strategy. Each GPU holds a complete copy of the model and processes a different portion of the training batch. After computing local gradients, GPUs synchronize via all-reduce to average gradients before updating weights.

DDP Training Flow

When to Use Data Parallelism

  • Model fits comfortably in single GPU memory
  • Need to increase training throughput
  • Simple setup with minimal code changes

PyTorch DistributedDataParallel Setup

import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP # Initialize process group with NCCL backend dist.init_process_group(backend='nccl') local_rank = int(os.environ['LOCAL_RANK']) torch.cuda.set_device(local_rank) # Create model and wrap with DDP model = MyModel().cuda(local_rank) model = DDP(model, device_ids=[local_rank]) # Training loop - gradients are automatically synchronized for batch in dataloader: inputs, labels = batch.cuda(local_rank), labels.cuda(local_rank) loss = criterion(model(inputs), labels) loss.backward() # All-reduce happens here optimizer.step() optimizer.zero_grad()
Key DDP Behaviors
  • Synchronous training: All GPUs must complete before the next iteration
  • Gradient averaging: All-reduce divides by world size automatically
  • Bucket fusion: Small gradients are grouped for efficient communication
  • Find unused parameters: Set find_unused_parameters=True for dynamic graphs

Tensor Parallelism: Split Within Layers

When a single layer is too large for one GPU, tensor parallelism splits weight matrices horizontally. Each GPU performs part of the matrix multiplication, then combines results via collective communication.

Tensor Parallel Matrix Splitting

Column vs Row Parallel

In transformer MLPs, we pair column parallel (first linear) with row parallel (second linear):

  1. Column Parallel: Split output dimension, each GPU computes partial features, all-gather to combine
  2. Row Parallel: Split input dimension (already partitioned), each GPU computes partial output, all-reduce to sum

This pairing minimizes communication by doing one all-gather and one all-reduce per MLP block instead of two of each.

Column vs Row Parallel Deep Dive

High Bandwidth Required

Tensor parallelism requires communication after every layer, not just after the backward pass. This demands high-bandwidth interconnects like NVLink (600+ GB/s). Using TP across nodes with only InfiniBand (100-400 GB/s) will severely bottleneck training.

Pipeline Parallelism: Split by Stages

Pipeline parallelism partitions the model into sequential stages, with each GPU responsible for a subset of layers. Micro-batching keeps GPUs busy by processing multiple smaller batches simultaneously.

Pipeline Schedules: GPipe vs 1F1B

The Bubble Problem

In naive pipelining, GPUs sit idle waiting for activations from previous stages. This creates "bubbles" of wasted compute time. The bubble ratio depends on the schedule:

  • Naive: ~75% bubble time with 4 stages
  • GPipe: (p-1)/m bubble ratio (p=stages, m=micro-batches)
  • 1F1B: (p-1)/(m+p-1) bubble ratio, plus lower memory usage
1F1B Memory Advantage

1F1B (One Forward One Backward) interleaves forward and backward passes, allowing earlier activation release. GPipe stores m activations per stage until backward begins. With p stages, 1F1B uses only p activations in flight regardless of micro-batch count.

3D Parallelism: Combining All Three

For training the largest models (GPT-3, Llama 70B+), we combine all three strategies. The key insight is to match parallelism type to interconnect bandwidth:

3D Parallelism Configuration

Communication Hierarchy

DimensionCommunication PatternFrequencyBest Interconnect
TPAll-reduce/gather per layerVery highNVLink (600+ GB/s)
PPPoint-to-point between stagesMediumInfiniBand (200+ GB/s)
DPAll-reduce gradients per iterationLowNetwork (any bandwidth)

Megatron-LM 3D Parallelism Configuration

# Training Llama 70B on 64 H100 GPUs (8 nodes × 8 GPUs) python train.py \ --tensor-model-parallel-size 8 \ # TP=8 within node (NVLink) --pipeline-model-parallel-size 4 \ # PP=4 across nodes --data-parallel-size 2 \ # DP=2 for larger batches --micro-batch-size 1 \ --global-batch-size 1024 \ --num-layers 80 \ --hidden-size 8192 \ --num-attention-heads 64 # Total: 8 × 4 × 2 = 64 GPUs

ZeRO: Memory-Efficient Data Parallelism

ZeRO (Zero Redundancy Optimizer) reduces memory footprint by partitioning optimizer states, gradients, and parameters across data-parallel ranks instead of replicating them.

ZeRO Memory Optimization Stages

Highlight:
16Φ0Standard DDPOptimizer (12Φ)Gradients (2Φ)Params (2Φ)16.0Φ per GPUZeRO-1Optimizer (3Φ)Gradients (2Φ)Params (2Φ)7.0Φ per GPUZeRO-2Optimizer (3Φ)Params (2Φ)5.5Φ per GPUZeRO-3Optimizer (3Φ)4.0Φ per GPU16×ParametersGradientsOptimizer StatesPartitioned
Standard DDP

Full replication on every GPU

ZeRO-1

Partition optimizer states

ZeRO-2

Partition optimizer + gradients

ZeRO-3

Partition everything

Communication Tradeoff

Each ZeRO stage reduces memory at the cost of more communication. ZeRO-1 has minimal overhead (optimizer states rarely accessed). ZeRO-3 requires all-gather for every forward pass, adding ~50% communication overhead but enabling training of models 8× larger.

DeepSpeed ZeRO-3 Config
{ "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu" }, "offload_param": { "device": "cpu" }, "overlap_comm": true, "contiguous_gradients": true } }

Memory Breakdown

For a model with Φ parameters using Adam optimizer with mixed precision:

ComponentMemory (per GPU)Notes
FP16 ParametersModel weights
FP16 GradientsComputed during backward
FP32 Optimizer States12ΦAdam: params + momentum + variance
Total16ΦStandard DDP per GPU

ZeRO Stages

  • ZeRO-1: Partition optimizer states → 4Φ per GPU (4× memory reduction)
  • ZeRO-2: + Partition gradients → 2.5Φ per GPU (8× memory reduction)
  • ZeRO-3: + Partition parameters → ~1Φ per GPU (16× memory reduction)

DeepSpeed ZeRO-3 Configuration

{ "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto" }, "fp16": { "enabled": true, "loss_scale_window": 100 } }
ZeRO vs FSDP

PyTorch FSDP (Fully Sharded Data Parallel) implements ZeRO-3 natively in PyTorch. DeepSpeed ZeRO offers more configuration options and offloading capabilities, while FSDP provides tighter integration with PyTorch ecosystem.

Choosing the Right Strategy

Parallelism Decision Tree

Strategy Comparison

StrategyMemory EfficiencyCommunicationComplexityBest For
DDPLow (full replication)Low (once per iter)SimpleSmall models, throughput scaling
ZeRO-1/2MediumLow-MediumModerateMemory-constrained DDP
ZeRO-3/FSDPHighHighModerateVery large models, simple API
TPHighVery HighComplexLarge models within node
PPHighMediumComplexCross-node, layer-wise split
3D ParallelHighestOptimizedMost ComplexLargest models (100B+)

Real-World Applications

Fine-tuning LLMs

Fine-tuning a 7B model on consumer GPUs

ZeRO-3 + gradient checkpointing on 4×A100 40GB

Pre-training Foundation Models

Training models from scratch at massive scale

3D Parallelism: TP=8, PP=8, DP=16 on 1024 H100s

Multi-GPU Inference

Serving large models that exceed single GPU memory

Tensor parallelism for Llama 70B across 8 GPUs

Research with Limited Resources

Training large models on university clusters

ZeRO-2 with CPU offloading on 8 V100s

Production Training Pipelines

Consistent, reproducible large-scale training

DDP with gradient accumulation for batch size flexibility

Mixture-of-Experts Training

Training sparse models with expert parallelism

Expert Parallelism + DP for MoE layers

Best Practices

  • Start with DDP: Begin with data parallelism and only add complexity when needed. DDP is well-tested and has minimal overhead.
  • Match Parallelism to Interconnect: Use TP within nodes (NVLink), PP across nodes (InfiniBand), and DP anywhere. Never use TP across nodes without NVSwitch.
  • Profile Before Optimizing: Use NVIDIA Nsight Systems to identify whether you are compute-bound or communication-bound before adding parallelism.
  • Tune Micro-batch Size: For pipeline parallelism, use at least 4× as many micro-batches as stages to minimize bubble overhead.
  • Enable Gradient Checkpointing: Trade compute for memory by recomputing activations during backward pass. Essential for very deep models.
  • Use Mixed Precision: FP16/BF16 training halves memory usage and doubles throughput on Tensor Cores. Always enable when possible.

Common Pitfalls to Avoid

!

Tensor Parallel Across Nodes

Using TP across nodes without NVLink results in 10-30× slower layer computation due to InfiniBand latency

Solution: Keep TP within a single DGX node and use PP for cross-node distribution
!

Incorrect World Size Calculation

3D parallelism requires world_size = DP × TP × PP; mismatched sizes cause deadlocks

Solution: Verify process group setup: TP groups within nodes, PP groups across nodes, DP groups spanning everything
!

Memory Fragmentation with ZeRO-3

Frequent all-gather and release of parameters can fragment GPU memory

Solution: Use contiguous_gradients=true and tune prefetch bucket sizes
!

Unbalanced Pipeline Stages

Stages with different computational costs create bubbles even with good scheduling

Solution: Profile layer times and balance parameters per stage, accounting for attention vs FFN differences

Advantages & Limitations

Advantages

  • Train models 100× larger than single GPU memory allows
  • Near-linear scaling efficiency with proper configuration
  • Flexibility to trade memory for compute and vice versa
  • Production-proven at scale (GPT-4, Llama, PaLM)
  • Well-supported by PyTorch, DeepSpeed, Megatron-LM
  • CPU/NVMe offloading extends effective memory further

Limitations

  • ×Significant engineering complexity for 3D parallelism
  • ×Requires high-bandwidth interconnects for efficiency
  • ×Debugging distributed training failures is challenging
  • ×Communication overhead limits scaling efficiency
  • ×Requires careful hyperparameter tuning for each configuration
  • ×Different parallelism strategies have different numerical behaviors

Further Reading

If you found this explanation helpful, consider sharing it with others.

Mastodon