Distributed Parallelism in Deep Learning

Overview

Modern deep learning models have grown too large to fit on a single GPU. Training GPT-4 or Llama 70B requires distributing computation across hundreds or thousands of GPUs. Distributed parallelism provides three orthogonal strategies to scale training: Data Parallel (split the batch), Tensor Parallel (split the layers), and Pipeline Parallel (split the model stages).

Understanding when to use each strategy—and how to combine them in 3D parallelism—is essential for efficient large-scale training. This guide covers the fundamentals of each approach, their trade-offs, and practical implementation with PyTorch and DeepSpeed.

Key Concepts

Data Parallelism (DDP)

Replicate the full model on each GPU and split the training batch. All-reduce gradients after backward pass to keep models synchronized.

Tensor Parallelism (TP)

Split individual weight matrices across GPUs. Each GPU computes partial results that are combined via all-reduce or all-gather per layer.

Pipeline Parallelism (PP)

Partition the model into sequential stages. Each GPU processes a different set of layers, passing activations forward through the pipeline.

3D Parallelism

Combine DP, TP, and PP to scale to thousands of GPUs. TP within nodes (NVLink), PP across nodes, DP for throughput scaling.

ZeRO Optimization

Partition optimizer states, gradients, and parameters across data-parallel ranks to reduce memory footprint without changing the parallelism model.

Communication Collectives

All-reduce (average gradients), all-gather (collect distributed data), reduce-scatter (scatter reduced results) form the backbone of distributed training.

Three Dimensions of Parallelism

Data Parallel

Same model on all GPUs. Split batch, sync gradients. Simple and efficient when model fits in memory.

Tensor Parallel

Split weight matrices across GPUs. High communication overhead—needs NVLink within nodes.

Pipeline Parallel

Split model by layers. Point-to-point communication. Use micro-batching to reduce bubble time.

Data Parallelism: Split the Batch

Data parallelism is the simplest and most widely used strategy. Each GPU holds a complete copy of the model and processes a different portion of the training batch. After computing local gradients, GPUs synchronize via all-reduce to average gradients before updating weights.

DDP Training Flow

Step 1 of 6: Load Data

Load Data

Each GPU loads its portion of the batch

Synchronous Training

All GPUs must complete their batch before gradients are synchronized. Slowest GPU determines throughput.

Gradient Averaging

All-reduce computes the mean gradient across all replicas, ensuring all models stay synchronized.

When to Use Data Parallelism

Model fits comfortably in single GPU memory
Need to increase training throughput
Simple setup with minimal code changes

PyTorch DistributedDataParallel Setup

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize process group with NCCL backend
dist.init_process_group(backend='nccl')
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)

# Create model and wrap with DDP
model = MyModel().cuda(local_rank)
model = DDP(model, device_ids=[local_rank])

# Training loop - gradients are automatically synchronized
for batch in dataloader:
    inputs, labels = batch.cuda(local_rank), labels.cuda(local_rank)
    loss = criterion(model(inputs), labels)
    loss.backward()  # All-reduce happens here
    optimizer.step()
    optimizer.zero_grad()

Key DDP Behaviors

Synchronous training: All GPUs must complete before the next iteration
Gradient averaging: All-reduce divides by world size automatically
Bucket fusion: Small gradients are grouped for efficient communication
Find unused parameters: Set find_unused_parameters=True for dynamic graphs

Tensor Parallelism: Split Within Layers

When a single layer is too large for one GPU, tensor parallelism splits weight matrices horizontally. Each GPU performs part of the matrix multiplication, then combines results via collective communication.

Tensor Parallel Matrix Splitting

Parallel Type:

Column Parallel (Split Output Features)

• Weight matrix split along columns (output dimension)
• Each GPU computes partial output features
• All-Gather concatenates results
• Used in: MLP first linear layer, attention QKV projection

Row Parallel (Split Input Features)

• Weight matrix split along rows (input dimension)
• Input must already be partitioned across GPUs
• All-Reduce sums partial outputs
• Used in: MLP second linear layer, attention output projection

Why Pair Column + Row Parallel?

In transformer MLPs, the first linear uses column parallel (splits hidden dimension), and the second uses row parallel (input already partitioned). This pairs All-Gather with All-Reduce, minimizing total communication per layer.

Column vs Row Parallel

In transformer MLPs, we pair column parallel (first linear) with row parallel (second linear):

Column Parallel: Split output dimension, each GPU computes partial features, all-gather to combine
Row Parallel: Split input dimension (already partitioned), each GPU computes partial output, all-reduce to sum

This pairing minimizes communication by doing one all-gather and one all-reduce per MLP block instead of two of each.

Column vs Row Parallel Deep Dive

In tensor parallelism, we split weight matrices across GPUs. But there are exactly two ways to slice a matrix: by columns or by rows. These aren't just different cuts—they have fundamentally different properties for how inputs must be distributed and how outputs must be combined.

Column Parallel: Split by Columns

In column-parallel mode, we slice the weight matrix vertically—each GPU gets a subset of the columns. Since each column of W produces one element of the output, each GPU computes a portion of the output features.

COLUMN PARALLEL

The Key Property

Input is replicated (same X on all GPUs), output is split (each GPU has part of Y). To get the full output, we must concatenate the partial results—this requires an all-gather operation.

Column-Parallel Mathematics

If W has shape [H_in, H_out] and we use TP = N GPUs:

W = [W₀ | W₁ | ... | W_N-1] where each Wᵢ has shape [H_in, H_out/N]

Each GPU i computes:

Yᵢ = X · Wᵢ → shape [B, S, H_out/N]

Final output requires concatenation:

Y = AllGather([Y₀, Y₁, ..., Y_N-1]) → shape [B, S, H_out]

Row Parallel: Split by Rows

In row-parallel mode, we slice the weight matrix horizontally—each GPU gets a subset of the rows. Since each row of W corresponds to one input feature, the input X must also be split. Each GPU computes a partial sum that must be combined.

ROW PARALLEL

The Key Property

Input is split (each GPU gets part of X along the last dimension),output needs summation. Each GPU produces a partial result with the same shape—to get the final output, we must sum them via all-reduce.

Row-Parallel Mathematics

If W has shape [H_in, H_out] and we use TP = N GPUs:

W = [W₀; W₁; ...; W_N-1] where each Wᵢ has shape [H_in/N, H_out]

Input X must also be split (along last dimension):

X = [X₀ | X₁ | ... | X_N-1] where each Xᵢ has shape [B, S, H_in/N]

Each GPU i computes a partial product:

Yᵢ = Xᵢ · Wᵢ → shape [B, S, H_out]

Final output requires summation:

Y = AllReduce(Y₀ + Y₁ + ... + Y_N-1) → shape [B, S, H_out]

The Critical Difference

Aspect	Column Parallel	Row Parallel
Weight split direction	Vertically (by columns)	Horizontally (by rows)
Input requirement	Replicated (same on all GPUs)	Split along last dimension
Partial output shape	[B, S, H_out/N] — different features	[B, S, H_out] — partial sums
Communication op	All-Gather (concatenate)	All-Reduce (sum)
Output state	Replicated	Replicated
Comm volume	~M bytes	~2M bytes (reduce-scatter + all-gather)

The Perfect Pair: Why They're Complementary

Here's the magic insight that makes Megatron-style tensor parallelism work: column-parallel output is already split in exactly the format row-parallel needs as input!

Why This Pairing is Brilliant

Column-parallel output state: Each GPU has [B, S, H/N] — split along last dim
Row-parallel input requirement: Needs [B, S, H/N] — split along last dim
Result: We can feed column-parallel output directly to row-parallel input with zero communication!
Total comm: Just 1 all-reduce at the end (after row-parallel), not 2 separate operations

Communication Volume Analysis

ALL-GATHER (Column Parallel)

Each GPU sends: M/N bytes

Each GPU receives: M × (N-1)/N bytes

Total per GPU: ≈ M bytes

ALL-REDUCE (Row Parallel)

Reduce-scatter: M × (N-1)/N

All-gather: M × (N-1)/N

Total per GPU: ≈ 2M bytes

PRACTICAL IMPLICATION

All-reduce costs roughly 2× the bandwidth of all-gather. This is why the Megatron pattern places row-parallel at the end of each block—where we need replicated output anyway (for residual connections).

Column vs Row: Two Sides of the Same Coin

Column Parallel: Keep input whole, split output features. Good for the "up projection" where we expand dimensions.

Row Parallel: Split input features, keep output whole. Good for the "down projection" where we reduce dimensions.

Together: They form a perfect pair. Column-parallel's split output is exactly what row-parallel needs as split input. Chain them to eliminate intermediate communication. This is the foundation of all modern tensor parallel implementations.

High Bandwidth Required

Tensor parallelism requires communication after every layer, not just after the backward pass. This demands high-bandwidth interconnects like NVLink (600+ GB/s). Using TP across nodes with only InfiniBand (100-400 GB/s) will severely bottleneck training.

Pipeline Parallelism: Split by Stages

Pipeline parallelism partitions the model into sequential stages, with each GPU responsible for a subset of layers. Micro-batching keeps GPUs busy by processing multiple smaller batches simultaneously.

Pipeline Schedules: GPipe vs 1F1B

Schedule:

GPipe

All forwards first, then all backwards. Better pipelining with micro-batches.

Efficiency:~60%

Bubble ratio:(p-1)/m

Parameters

p = 4 pipeline stages

m = 4 micro-batches

1F1B Memory Advantage

1F1B maintains at most p activations in memory (one per stage), while GPipe stores m activations until backward begins. With large micro-batch counts, 1F1B uses significantly less memory.

The Bubble Problem

In naive pipelining, GPUs sit idle waiting for activations from previous stages. This creates "bubbles" of wasted compute time. The bubble ratio depends on the schedule:

Naive: ~75% bubble time with 4 stages
GPipe: (p-1)/m bubble ratio (p=stages, m=micro-batches)
1F1B: (p-1)/(m+p-1) bubble ratio, plus lower memory usage

1F1B Memory Advantage

1F1B (One Forward One Backward) interleaves forward and backward passes, allowing earlier activation release. GPipe stores m activations per stage until backward begins. With p stages, 1F1B uses only p activations in flight regardless of micro-batch count.

3D Parallelism: Combining All Three

For training the largest models (GPT-3, Llama 70B+), we combine all three strategies. The key insight is to match parallelism type to interconnect bandwidth:

3D Parallelism Configuration

Data Parallel (DP): 2

Split batch across replicas

Tensor Parallel (TP): 2

Split layers within node

Pipeline Parallel (PP): 2

Split model by stages

2 × 2 × 2 = 8 GPUs

TP Communication

Highest bandwidth needed. Use NVLink within node (600+ GB/s). All-reduce/all-gather per layer.

PP Communication

Point-to-point between stages. Moderate bandwidth needed. Can span nodes via InfiniBand.

DP Communication

Gradient sync once per iteration. Can span multiple nodes. All-reduce over network.

Example: Training Llama 70B on 64 GPUs

# Megatron-LM style configuration
TP=8   # 8-way tensor parallel within DGX node (NVLink)
PP=4   # 4-way pipeline parallel across nodes
DP=2   # 2-way data parallel for larger batches
# Total: 8 × 4 × 2 = 64 GPUs

Communication Hierarchy

Dimension	Communication Pattern	Frequency	Best Interconnect
TP	All-reduce/gather per layer	Very high	NVLink (600+ GB/s)
PP	Point-to-point between stages	Medium	InfiniBand (200+ GB/s)
DP	All-reduce gradients per iteration	Low	Network (any bandwidth)

Megatron-LM 3D Parallelism Configuration

# Training Llama 70B on 64 H100 GPUs (8 nodes × 8 GPUs)
python train.py \
    --tensor-model-parallel-size 8 \   # TP=8 within node (NVLink)
    --pipeline-model-parallel-size 4 \  # PP=4 across nodes
    --data-parallel-size 2 \            # DP=2 for larger batches
    --micro-batch-size 1 \
    --global-batch-size 1024 \
    --num-layers 80 \
    --hidden-size 8192 \
    --num-attention-heads 64

# Total: 8 × 4 × 2 = 64 GPUs

ZeRO: Memory-Efficient Data Parallelism

ZeRO (Zero Redundancy Optimizer) reduces memory footprint by partitioning optimizer states, gradients, and parameters across data-parallel ranks instead of replicating them.

ZeRO Memory Optimization Stages

Highlight:

Standard DDP

Full replication on every GPU

ZeRO-1

Partition optimizer states

ZeRO-2

Partition optimizer + gradients

ZeRO-3

Partition everything

Communication Tradeoff

Each ZeRO stage reduces memory at the cost of more communication. ZeRO-1 has minimal overhead (optimizer states rarely accessed). ZeRO-3 requires all-gather for every forward pass, adding ~50% communication overhead but enabling training of models 8× larger.

DeepSpeed ZeRO-3 Config

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": { "device": "cpu" },
    "offload_param": { "device": "cpu" },
    "overlap_comm": true,
    "contiguous_gradients": true
  }
}

Memory Breakdown

For a model with Φ parameters using Adam optimizer with mixed precision:

Component	Memory (per GPU)	Notes
FP16 Parameters	2Φ	Model weights
FP16 Gradients	2Φ	Computed during backward
FP32 Optimizer States	12Φ	Adam: params + momentum + variance
Total	16Φ	Standard DDP per GPU

ZeRO Stages

ZeRO-1: Partition optimizer states → 4Φ per GPU (4× memory reduction)
ZeRO-2: + Partition gradients → 2.5Φ per GPU (8× memory reduction)
ZeRO-3: + Partition parameters → ~1Φ per GPU (16× memory reduction)

DeepSpeed ZeRO-3 Configuration

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto"
  },
  "fp16": {
    "enabled": true,
    "loss_scale_window": 100
  }
}

ZeRO vs FSDP

PyTorch FSDP (Fully Sharded Data Parallel) implements ZeRO-3 natively in PyTorch. DeepSpeed ZeRO offers more configuration options and offloading capabilities, while FSDP provides tighter integration with PyTorch ecosystem.

Choosing the Right Strategy

Parallelism Decision Tree

Path:

Does model fit in single GPU memory?

DDP

Model fits, scale throughput

Tensor Parallel

Within-node, NVLink

Pipeline Parallel

Cross-node, by layers

ZeRO/FSDP

Memory optimization

Strategy Comparison

Strategy	Memory Efficiency	Communication	Complexity	Best For
DDP	Low (full replication)	Low (once per iter)	Simple	Small models, throughput scaling
ZeRO-1/2	Medium	Low-Medium	Moderate	Memory-constrained DDP
ZeRO-3/FSDP	High	High	Moderate	Very large models, simple API
TP	High	Very High	Complex	Large models within node
PP	High	Medium	Complex	Cross-node, layer-wise split
3D Parallel	Highest	Optimized	Most Complex	Largest models (100B+)

Real-World Applications

Fine-tuning LLMs

Fine-tuning a 7B model on consumer GPUs

ZeRO-3 + gradient checkpointing on 4×A100 40GB

Pre-training Foundation Models

Training models from scratch at massive scale

3D Parallelism: TP=8, PP=8, DP=16 on 1024 H100s

Multi-GPU Inference

Serving large models that exceed single GPU memory

Tensor parallelism for Llama 70B across 8 GPUs

Research with Limited Resources

Training large models on university clusters

ZeRO-2 with CPU offloading on 8 V100s

Production Training Pipelines

Consistent, reproducible large-scale training

DDP with gradient accumulation for batch size flexibility

Mixture-of-Experts Training

Training sparse models with expert parallelism

Expert Parallelism + DP for MoE layers

Best Practices

Start with DDP: Begin with data parallelism and only add complexity when needed. DDP is well-tested and has minimal overhead.
Match Parallelism to Interconnect: Use TP within nodes (NVLink), PP across nodes (InfiniBand), and DP anywhere. Never use TP across nodes without NVSwitch.
Profile Before Optimizing: Use NVIDIA Nsight Systems to identify whether you are compute-bound or communication-bound before adding parallelism.
Tune Micro-batch Size: For pipeline parallelism, use at least 4× as many micro-batches as stages to minimize bubble overhead.
Enable Gradient Checkpointing: Trade compute for memory by recomputing activations during backward pass. Essential for very deep models.
Use Mixed Precision: FP16/BF16 training halves memory usage and doubles throughput on Tensor Cores. Always enable when possible.

Common Pitfalls to Avoid

Tensor Parallel Across Nodes

Using TP across nodes without NVLink results in 10-30× slower layer computation due to InfiniBand latency

Solution: Keep TP within a single DGX node and use PP for cross-node distribution

Incorrect World Size Calculation

3D parallelism requires world_size = DP × TP × PP; mismatched sizes cause deadlocks

Solution: Verify process group setup: TP groups within nodes, PP groups across nodes, DP groups spanning everything

Memory Fragmentation with ZeRO-3

Frequent all-gather and release of parameters can fragment GPU memory

Solution: Use contiguous_gradients=true and tune prefetch bucket sizes

Unbalanced Pipeline Stages

Stages with different computational costs create bubbles even with good scheduling

Solution: Profile layer times and balance parameters per stage, accounting for attention vs FFN differences

Advantages & Limitations

Advantages

✓Train models 100× larger than single GPU memory allows
✓Near-linear scaling efficiency with proper configuration
✓Flexibility to trade memory for compute and vice versa
✓Production-proven at scale (GPT-4, Llama, PaLM)
✓Well-supported by PyTorch, DeepSpeed, Megatron-LM
✓CPU/NVMe offloading extends effective memory further

Limitations

×Significant engineering complexity for 3D parallelism
×Requires high-bandwidth interconnects for efficiency
×Debugging distributed training failures is challenging
×Communication overhead limits scaling efficiency
×Requires careful hyperparameter tuning for each configuration
×Different parallelism strategies have different numerical behaviors