Distributed Parallelism in Deep Learning

GPU distributed parallelism: Data Parallel (DDP), Tensor Parallel, Pipeline Parallel, and ZeRO optimization for training large AI models.

Best viewed on desktop for optimal interactive experience

Overview

Modern deep learning models have grown too large to fit on a single GPU. Training GPT-4 or Llama 70B requires distributing computation across hundreds or thousands of GPUs. Distributed parallelism provides three orthogonal strategies to scale training: Data Parallel (split the batch), Tensor Parallel (split the layers), and Pipeline Parallel (split the model stages).

Understanding when to use each strategy—and how to combine them in 3D parallelism—is essential for efficient large-scale training. This guide covers the fundamentals of each approach, their trade-offs, and practical implementation with PyTorch and DeepSpeed.

Key Concepts

Data Parallelism (DDP)

Replicate the full model on each GPU and split the training batch. All-reduce gradients after backward pass to keep models synchronized.

Tensor Parallelism (TP)

Split individual weight matrices across GPUs. Each GPU computes partial results that are combined via all-reduce or all-gather per layer.

Pipeline Parallelism (PP)

Partition the model into sequential stages. Each GPU processes a different set of layers, passing activations forward through the pipeline.

3D Parallelism

Combine DP, TP, and PP to scale to thousands of GPUs. TP within nodes (NVLink), PP across nodes, DP for throughput scaling.

ZeRO Optimization

Partition optimizer states, gradients, and parameters across data-parallel ranks to reduce memory footprint without changing the parallelism model.

Communication Collectives

All-reduce (average gradients), all-gather (collect distributed data), reduce-scatter (scatter reduced results) form the backbone of distributed training.

Three Dimensions of Parallelism

DATA PARALLELISMSplit the BATCHFull Batch (32 samples)GPU 0: 8GPU 1: 8GPU 2: 8GPU 3: 8Each GPU has FULL modelModelCopyModelCopyModelCopyModelCopyAll-Reduce gradientsafter backward passMemory: Full model per GPUTENSOR PARALLELISMSplit each LAYER horizontallyLinear Layer[4096 × 16384]GPU 0[4096×4096]GPU 1[4096×4096]GPU 2[4096×4096]GPU 3[4096×4096]All-reduce after each layerHigh bandwidth needed!Best within nodeNVLink: 600 GB/sMemory: 1/N per GPUPIPELINE PARALLELISMSplit LAYERS verticallyGPU 0: Layers 0-7GPU 1: Layers 8-15GPU 2: Layers 16-23GPU 3: Layers 24-31Micro-batches keep GPUs busyBubble: idle time between stagesMemory: 1/P layers per GPU

Data Parallel

Same model on all GPUs. Split batch, sync gradients. Simple and efficient when model fits in memory.

Tensor Parallel

Split weight matrices across GPUs. High communication overhead—needs NVLink within nodes.

Pipeline Parallel

Split model by layers. Point-to-point communication. Use micro-batching to reduce bubble time.

Data Parallelism: Split the Batch

Data parallelism is the simplest and most widely used strategy. Each GPU holds a complete copy of the model and processes a different portion of the training batch. After computing local gradients, GPUs synchronize via all-reduce to average gradients before updating weights.

DDP Training Flow

Step 1 of 6: Load Data
GPU 0 (Rank 0)Batch[0:8]Model (full copy)Forward PassBackward Pass∇θ (gradients)GPU 1 (Rank 1)Batch[8:16]Model (full copy)Forward PassBackward Pass∇θ (gradients)GPU 2 (Rank 2)Batch[16:24]Model (full copy)Forward PassBackward Pass∇θ (gradients)GPU 3 (Rank 3)Batch[24:32]Model (full copy)Forward PassBackward Pass∇θ (gradients)

Load Data

Each GPU loads its portion of the batch

Synchronous Training

All GPUs must complete their batch before gradients are synchronized. Slowest GPU determines throughput.

Gradient Averaging

All-reduce computes the mean gradient across all replicas, ensuring all models stay synchronized.

When to Use Data Parallelism

  • Model fits comfortably in single GPU memory
  • Need to increase training throughput
  • Simple setup with minimal code changes

PyTorch DistributedDataParallel Setup

import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP # Initialize process group with NCCL backend dist.init_process_group(backend='nccl') local_rank = int(os.environ['LOCAL_RANK']) torch.cuda.set_device(local_rank) # Create model and wrap with DDP model = MyModel().cuda(local_rank) model = DDP(model, device_ids=[local_rank]) # Training loop - gradients are automatically synchronized for batch in dataloader: inputs, labels = batch.cuda(local_rank), labels.cuda(local_rank) loss = criterion(model(inputs), labels) loss.backward() # All-reduce happens here optimizer.step() optimizer.zero_grad()
Key DDP Behaviors
  • Synchronous training: All GPUs must complete before the next iteration
  • Gradient averaging: All-reduce divides by world size automatically
  • Bucket fusion: Small gradients are grouped for efficient communication
  • Find unused parameters: Set find_unused_parameters=True for dynamic graphs

Tensor Parallelism: Split Within Layers

When a single layer is too large for one GPU, tensor parallelism splits weight matrices horizontally. Each GPU performs part of the matrix multiplication, then combines results via collective communication.

Tensor Parallel Matrix Splitting

Parallel Type:
Input X[batch, 4096]Same on all GPUsWeight Matrix A[4096, 16384]GPU 0[4096, 8192]GPU 1[4096, 8192]Partial OutputsY₀ = X @ A₀[batch, 8192]Y₁ = X @ A₁[batch, 8192]ALL-GATHERConcatenate[Y₀ | Y₁]→ [batch, 16384]Full Output Y[batch, 16384]Column Parallel: Split output features, All-Gather at end
Column Parallel (Split Output Features)
  • • Weight matrix split along columns (output dimension)
  • • Each GPU computes partial output features
  • All-Gather concatenates results
  • • Used in: MLP first linear layer, attention QKV projection
Row Parallel (Split Input Features)
  • • Weight matrix split along rows (input dimension)
  • • Input must already be partitioned across GPUs
  • All-Reduce sums partial outputs
  • • Used in: MLP second linear layer, attention output projection
Why Pair Column + Row Parallel?

In transformer MLPs, the first linear uses column parallel (splits hidden dimension), and the second uses row parallel (input already partitioned). This pairs All-Gather with All-Reduce, minimizing total communication per layer.

Column vs Row Parallel

In transformer MLPs, we pair column parallel (first linear) with row parallel (second linear):

  1. Column Parallel: Split output dimension, each GPU computes partial features, all-gather to combine
  2. Row Parallel: Split input dimension (already partitioned), each GPU computes partial output, all-reduce to sum

This pairing minimizes communication by doing one all-gather and one all-reduce per MLP block instead of two of each.

Column vs Row Parallel Deep Dive

In tensor parallelism, we split weight matrices across GPUs. But there are exactly two ways to slice a matrix: by columns or by rows. These aren't just different cuts—they have fundamentally different properties for how inputs must be distributed and how outputs must be combined.

Column Parallel: Split by Columns

In column-parallel mode, we slice the weight matrix vertically—each GPU gets a subset of the columns. Since each column of W produces one element of the output, each GPU computes a portion of the output features.

COLUMN PARALLEL

The Key Property

Input is replicated (same X on all GPUs), output is split (each GPU has part of Y). To get the full output, we must concatenate the partial results—this requires an all-gather operation.

Column Parallel: Split W by Columns (TP = 4)STEP 1: Replicate Input X on All GPUsX[B,S,H]GPU 0X[B,S,H]GPU 1X[B,S,H]GPU 2X[B,S,H]GPU 3✓ Same input everywhereSTEP 2: Partition W by ColumnsW₀W₁W₂W₃W0[H, H]GPU 0W1[H, H]GPU 1W2[H, H]GPU 2W3[H, H]GPU 3STEP 3: Local ComputeEach GPU computes:Yᵢ = X · Wᵢ[B,S,H] × [H,H] = [B,S,H]STEP 4: Partial Outputs (each GPU has 1/4 of output features)Y0 = X · W0[B, S, H] ← GPU 0Y1 = X · W1[B, S, H] ← GPU 1Y2 = X · W2[B, S, H] ← GPU 2Y3 = X · W3[B, S, H] ← GPU 3STEP 5: All-Gather to Get Full OutputALL-GATHER: Concatenate [Y₀, Y₁, Y₂, Y₃] along last dimensionEach GPU collects all pieces → Every GPU now has complete YRESULT: Full Output on Every GPUY = [Y₀ | Y₁ | Y₂ | Y₃][B, S, H] | [B, S, H] | [B, S, H] | [B, S, H] = [B, S, 4H]

Column-Parallel Mathematics

If W has shape [H_in, H_out] and we use TP = N GPUs:

W = [W₀ | W₁ | ... | W_N-1] where each Wᵢ has shape [H_in, H_out/N]

Each GPU i computes:

Yᵢ = X · Wᵢ → shape [B, S, H_out/N]

Final output requires concatenation:

Y = AllGather([Y₀, Y₁, ..., Y_N-1]) → shape [B, S, H_out]

Row Parallel: Split by Rows

In row-parallel mode, we slice the weight matrix horizontally—each GPU gets a subset of the rows. Since each row of W corresponds to one input feature, the input X must also be split. Each GPU computes a partial sum that must be combined.

ROW PARALLEL

The Key Property

Input is split (each GPU gets part of X along the last dimension),output needs summation. Each GPU produces a partial result with the same shape—to get the final output, we must sum them via all-reduce.

Row Parallel: Split W by Rows (TP = 4)STEP 1: Split Input X Along Last DimFull X: [B, S, H]H = 4 chunksX0[B,S,H/4]GPU 0X1[B,S,H/4]GPU 1X2[B,S,H/4]GPU 2X3[B,S,H/4]GPU 3STEP 2: Partition W by RowsW₀ [H/4, H_out]W₁ [H/4, H_out]W₂ [H/4, H_out]W₃ [H/4, H_out]STEP 3: Local ComputeEach GPU computes:Yᵢ = Xᵢ · Wᵢ[B,S,H/4] × [H/4,H_out]= [B,S,H_out]Why Row-Parallel Needs SummationY = X·W = [X₀|X₁|X₂|X₃] · [W₀; W₁; W₂; W₃] = X₀·W₀ + X₁·W₁ + X₂·W₂ + X₃·W₃Each GPU computes one term of this sum!STEP 4: Partial Outputs (same shape, different values — need to SUM)Y0 = X0 · W0[B, S, H_out] ← GPU 0Y1 = X1 · W1[B, S, H_out] ← GPU 1Y2 = X2 · W2[B, S, H_out] ← GPU 2Y3 = X3 · W3[B, S, H_out] ← GPU 3⚠ All have shape [B, S, H_out] but contain PARTIAL sumsSTEP 5: All-Reduce (Sum) to Get Final OutputALL-REDUCE (SUM): Y = Y₀ + Y₁ + Y₂ + Y₃Each GPU sums all partials → Every GPU now has complete YRESULT: Full Output on Every GPUY = Y₀ + Y₁ + Y₂ + Y₃[B, S, H_out] — replicated on all GPUs

Row-Parallel Mathematics

If W has shape [H_in, H_out] and we use TP = N GPUs:

W = [W₀; W₁; ...; W_N-1] where each Wᵢ has shape [H_in/N, H_out]

Input X must also be split (along last dimension):

X = [X₀ | X₁ | ... | X_N-1] where each Xᵢ has shape [B, S, H_in/N]

Each GPU i computes a partial product:

Yᵢ = Xᵢ · Wᵢ → shape [B, S, H_out]

Final output requires summation:

Y = AllReduce(Y₀ + Y₁ + ... + Y_N-1) → shape [B, S, H_out]

The Critical Difference

AspectColumn ParallelRow Parallel
Weight split directionVertically (by columns)Horizontally (by rows)
Input requirementReplicated (same on all GPUs)Split along last dimension
Partial output shape[B, S, H_out/N] — different features[B, S, H_out] — partial sums
Communication opAll-Gather (concatenate)All-Reduce (sum)
Output stateReplicatedReplicated
Comm volume~M bytes~2M bytes (reduce-scatter + all-gather)

The Perfect Pair: Why They're Complementary

Here's the magic insight that makes Megatron-style tensor parallelism work: column-parallel output is already split in exactly the format row-parallel needs as input!

MLP Block: Column-Parallel → Row-Parallel (No Middle Communication!)Input X[B, S, H]ReplicatedColumn Parallel (FC1)W₁: [H, 4H]Each GPU: [H, H]+ GeLUHidden StateSPLIT (no comm!)Row Parallel (FC2)W₂: [4H, H]Each GPU: [H, H]→ needs sumALL-REDUCE: Sum partial outputs from Row-ParallelThis is the ONLY communication in the entire MLP forward pass!Output Y[B, S, H] — ReplicatedTHE MEGATRON TRICKColumn-parallel produces split output along the feature dimension.Row-parallel needs split input along the feature dimension. They match perfectly!
Why This Pairing is Brilliant
  • Column-parallel output state: Each GPU has [B, S, H/N] — split along last dim
  • Row-parallel input requirement: Needs [B, S, H/N] — split along last dim
  • Result: We can feed column-parallel output directly to row-parallel input with zero communication!
  • Total comm: Just 1 all-reduce at the end (after row-parallel), not 2 separate operations

Communication Volume Analysis

ALL-GATHER (Column Parallel)

Each GPU sends: M/N bytes

Each GPU receives: M × (N-1)/N bytes

Total per GPU: ≈ M bytes

ALL-REDUCE (Row Parallel)

Reduce-scatter: M × (N-1)/N

All-gather: M × (N-1)/N

Total per GPU: ≈ 2M bytes

PRACTICAL IMPLICATION

All-reduce costs roughly the bandwidth of all-gather. This is why the Megatron pattern places row-parallel at the end of each block—where we need replicated output anyway (for residual connections).

Column vs Row: Two Sides of the Same Coin

Column Parallel: Keep input whole, split output features. Good for the "up projection" where we expand dimensions.

Row Parallel: Split input features, keep output whole. Good for the "down projection" where we reduce dimensions.

Together: They form a perfect pair. Column-parallel's split output is exactly what row-parallel needs as split input. Chain them to eliminate intermediate communication. This is the foundation of all modern tensor parallel implementations.

High Bandwidth Required

Tensor parallelism requires communication after every layer, not just after the backward pass. This demands high-bandwidth interconnects like NVLink (600+ GB/s). Using TP across nodes with only InfiniBand (100-400 GB/s) will severely bottleneck training.

Pipeline Parallelism: Split by Stages

Pipeline parallelism partitions the model into sequential stages, with each GPU responsible for a subset of layers. Micro-batching keeps GPUs busy by processing multiple smaller batches simultaneously.

Pipeline Schedules: GPipe vs 1F1B

Schedule:
Time →012345678910111213141516GPU 0F0F1F2F3B3B2B1B0GPU 1F0F1F2F3B3B2B1B0GPU 2F0F1F2F3B3B2B1B0GPU 3F0F1F2F3B3B2B1B0Legend:MB 0MB 1MB 2MB 3F = ForwardB = Backward
GPipe

All forwards first, then all backwards. Better pipelining with micro-batches.

Efficiency:~60%
Bubble ratio:(p-1)/m
Parameters
p = 4 pipeline stages
m = 4 micro-batches
1F1B Memory Advantage

1F1B maintains at most p activations in memory (one per stage), while GPipe stores m activations until backward begins. With large micro-batch counts, 1F1B uses significantly less memory.

The Bubble Problem

In naive pipelining, GPUs sit idle waiting for activations from previous stages. This creates "bubbles" of wasted compute time. The bubble ratio depends on the schedule:

  • Naive: ~75% bubble time with 4 stages
  • GPipe: (p-1)/m bubble ratio (p=stages, m=micro-batches)
  • 1F1B: (p-1)/(m+p-1) bubble ratio, plus lower memory usage
1F1B Memory Advantage

1F1B (One Forward One Backward) interleaves forward and backward passes, allowing earlier activation release. GPipe stores m activations per stage until backward begins. With p stages, 1F1B uses only p activations in flight regardless of micro-batch count.

3D Parallelism: Combining All Three

For training the largest models (GPT-3, Llama 70B+), we combine all three strategies. The key insight is to match parallelism type to interconnect bandwidth:

3D Parallelism Configuration

Split batch across replicas

Split layers within node

Split model by stages

2 × 2 × 2 = 8 GPUs
TPPPDP40561273
TP Communication

Highest bandwidth needed. Use NVLink within node (600+ GB/s). All-reduce/all-gather per layer.

PP Communication

Point-to-point between stages. Moderate bandwidth needed. Can span nodes via InfiniBand.

DP Communication

Gradient sync once per iteration. Can span multiple nodes. All-reduce over network.

Example: Training Llama 70B on 64 GPUs
# Megatron-LM style configuration TP=8 # 8-way tensor parallel within DGX node (NVLink) PP=4 # 4-way pipeline parallel across nodes DP=2 # 2-way data parallel for larger batches # Total: 8 × 4 × 2 = 64 GPUs

Communication Hierarchy

DimensionCommunication PatternFrequencyBest Interconnect
TPAll-reduce/gather per layerVery highNVLink (600+ GB/s)
PPPoint-to-point between stagesMediumInfiniBand (200+ GB/s)
DPAll-reduce gradients per iterationLowNetwork (any bandwidth)

Megatron-LM 3D Parallelism Configuration

# Training Llama 70B on 64 H100 GPUs (8 nodes × 8 GPUs) python train.py \ --tensor-model-parallel-size 8 \ # TP=8 within node (NVLink) --pipeline-model-parallel-size 4 \ # PP=4 across nodes --data-parallel-size 2 \ # DP=2 for larger batches --micro-batch-size 1 \ --global-batch-size 1024 \ --num-layers 80 \ --hidden-size 8192 \ --num-attention-heads 64 # Total: 8 × 4 × 2 = 64 GPUs

ZeRO: Memory-Efficient Data Parallelism

ZeRO (Zero Redundancy Optimizer) reduces memory footprint by partitioning optimizer states, gradients, and parameters across data-parallel ranks instead of replicating them.

ZeRO Memory Optimization Stages

Highlight:
16Φ0Standard DDPOptimizer (12Φ)Gradients (2Φ)Params (2Φ)16.0Φ per GPUZeRO-1Optimizer (3Φ)Gradients (2Φ)Params (2Φ)7.0Φ per GPUZeRO-2Optimizer (3Φ)Params (2Φ)5.5Φ per GPUZeRO-3Optimizer (3Φ)4.0Φ per GPU16×ParametersGradientsOptimizer StatesPartitioned
Standard DDP

Full replication on every GPU

ZeRO-1

Partition optimizer states

ZeRO-2

Partition optimizer + gradients

ZeRO-3

Partition everything

Communication Tradeoff

Each ZeRO stage reduces memory at the cost of more communication. ZeRO-1 has minimal overhead (optimizer states rarely accessed). ZeRO-3 requires all-gather for every forward pass, adding ~50% communication overhead but enabling training of models 8× larger.

DeepSpeed ZeRO-3 Config
{ "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu" }, "offload_param": { "device": "cpu" }, "overlap_comm": true, "contiguous_gradients": true } }

Memory Breakdown

For a model with Φ parameters using Adam optimizer with mixed precision:

ComponentMemory (per GPU)Notes
FP16 ParametersModel weights
FP16 GradientsComputed during backward
FP32 Optimizer States12ΦAdam: params + momentum + variance
Total16ΦStandard DDP per GPU

ZeRO Stages

  • ZeRO-1: Partition optimizer states → 4Φ per GPU (4× memory reduction)
  • ZeRO-2: + Partition gradients → 2.5Φ per GPU (8× memory reduction)
  • ZeRO-3: + Partition parameters → ~1Φ per GPU (16× memory reduction)

DeepSpeed ZeRO-3 Configuration

{ "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto" }, "fp16": { "enabled": true, "loss_scale_window": 100 } }
ZeRO vs FSDP

PyTorch FSDP (Fully Sharded Data Parallel) implements ZeRO-3 natively in PyTorch. DeepSpeed ZeRO offers more configuration options and offloading capabilities, while FSDP provides tighter integration with PyTorch ecosystem.

Choosing the Right Strategy

Parallelism Decision Tree

Path:

Does model fit in single GPU memory?

DDP

Model fits, scale throughput

Tensor Parallel

Within-node, NVLink

Pipeline Parallel

Cross-node, by layers

ZeRO/FSDP

Memory optimization

Strategy Comparison

StrategyMemory EfficiencyCommunicationComplexityBest For
DDPLow (full replication)Low (once per iter)SimpleSmall models, throughput scaling
ZeRO-1/2MediumLow-MediumModerateMemory-constrained DDP
ZeRO-3/FSDPHighHighModerateVery large models, simple API
TPHighVery HighComplexLarge models within node
PPHighMediumComplexCross-node, layer-wise split
3D ParallelHighestOptimizedMost ComplexLargest models (100B+)

Real-World Applications

Fine-tuning LLMs

Fine-tuning a 7B model on consumer GPUs

ZeRO-3 + gradient checkpointing on 4×A100 40GB

Pre-training Foundation Models

Training models from scratch at massive scale

3D Parallelism: TP=8, PP=8, DP=16 on 1024 H100s

Multi-GPU Inference

Serving large models that exceed single GPU memory

Tensor parallelism for Llama 70B across 8 GPUs

Research with Limited Resources

Training large models on university clusters

ZeRO-2 with CPU offloading on 8 V100s

Production Training Pipelines

Consistent, reproducible large-scale training

DDP with gradient accumulation for batch size flexibility

Mixture-of-Experts Training

Training sparse models with expert parallelism

Expert Parallelism + DP for MoE layers

Best Practices

  • Start with DDP: Begin with data parallelism and only add complexity when needed. DDP is well-tested and has minimal overhead.
  • Match Parallelism to Interconnect: Use TP within nodes (NVLink), PP across nodes (InfiniBand), and DP anywhere. Never use TP across nodes without NVSwitch.
  • Profile Before Optimizing: Use NVIDIA Nsight Systems to identify whether you are compute-bound or communication-bound before adding parallelism.
  • Tune Micro-batch Size: For pipeline parallelism, use at least 4× as many micro-batches as stages to minimize bubble overhead.
  • Enable Gradient Checkpointing: Trade compute for memory by recomputing activations during backward pass. Essential for very deep models.
  • Use Mixed Precision: FP16/BF16 training halves memory usage and doubles throughput on Tensor Cores. Always enable when possible.

Common Pitfalls to Avoid

!

Tensor Parallel Across Nodes

Using TP across nodes without NVLink results in 10-30× slower layer computation due to InfiniBand latency

Solution: Keep TP within a single DGX node and use PP for cross-node distribution
!

Incorrect World Size Calculation

3D parallelism requires world_size = DP × TP × PP; mismatched sizes cause deadlocks

Solution: Verify process group setup: TP groups within nodes, PP groups across nodes, DP groups spanning everything
!

Memory Fragmentation with ZeRO-3

Frequent all-gather and release of parameters can fragment GPU memory

Solution: Use contiguous_gradients=true and tune prefetch bucket sizes
!

Unbalanced Pipeline Stages

Stages with different computational costs create bubbles even with good scheduling

Solution: Profile layer times and balance parameters per stage, accounting for attention vs FFN differences

Advantages & Limitations

Advantages

  • Train models 100× larger than single GPU memory allows
  • Near-linear scaling efficiency with proper configuration
  • Flexibility to trade memory for compute and vice versa
  • Production-proven at scale (GPT-4, Llama, PaLM)
  • Well-supported by PyTorch, DeepSpeed, Megatron-LM
  • CPU/NVMe offloading extends effective memory further

Limitations

  • ×Significant engineering complexity for 3D parallelism
  • ×Requires high-bandwidth interconnects for efficiency
  • ×Debugging distributed training failures is challenging
  • ×Communication overhead limits scaling efficiency
  • ×Requires careful hyperparameter tuning for each configuration
  • ×Different parallelism strategies have different numerical behaviors

Further Reading

If you found this explanation helpful, consider sharing it with others.

Mastodon