DataParallel vs DistributedDataParallel
When scaling PyTorch training to multiple GPUs, you have two main options: DataParallel (DP) and DistributedDataParallel (DDP). While DP is simpler, DDP is almost always the better choice for serious multi-GPU training.
Interactive Visualization
DataParallel vs DistributedDataParallel
DataParallel (DP)
Single process, single Python interpreter. GIL limits parallelism. Model replicated every forward pass.
Single Process (GIL Bottleneck)
Scaling Efficiency (relative speedup)
Step 1: Scatter Input
GPU 0 splits batch and scatters to all GPUs
DataParallel Issues
- Single process limited by GIL
- Model replicated every forward pass
- GPU 0 memory imbalance (gathers outputs)
- Poor scaling beyond 2-4 GPUs
DDP Advantages
- Multi-process, no GIL contention
- Model stays on each GPU (no replication)
- Balanced memory usage across GPUs
- Near-linear scaling to many GPUs
The Core Difference
DataParallel (DP): Single Process
# Simple, but limited model = nn.DataParallel(model) model = model.to('cuda:0') # Primary GPU
DP runs in a single Python process. This means:
- One Python interpreter
- One GIL (Global Interpreter Lock)
- All GPU operations managed by one process
DistributedDataParallel (DDP): Multiple Processes
# More setup, but scales better import torch.distributed as dist dist.init_process_group(backend='nccl') model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
DDP launches one process per GPU. This means:
- Multiple Python interpreters
- No GIL contention between GPUs
- Each process manages its own GPU
Why the GIL Matters
Python's Global Interpreter Lock prevents true parallel execution of Python bytecode:
# With DataParallel (single process) # Thread 1: Forward on GPU 0 ─┐ # Thread 2: Forward on GPU 1 ─┼─► Only ONE thread runs Python at a time! # Thread 3: Forward on GPU 2 ─┤ # Thread 4: Forward on GPU 3 ─┘ # With DDP (multiple processes) # Process 0: Forward on GPU 0 ───► Independent Python interpreter # Process 1: Forward on GPU 1 ───► Independent Python interpreter # Process 2: Forward on GPU 2 ───► Independent Python interpreter # Process 3: Forward on GPU 3 ───► Independent Python interpreter
The GIL is why DP's scaling degrades significantly beyond 2-4 GPUs.
DataParallel Deep Dive
How DP Works
- Scatter: Split batch across GPUs
- Replicate: Copy model to each GPU (every forward pass!)
- Forward: Run forward pass on each GPU
- Gather: Collect outputs on GPU 0
- Backward: Compute gradients (GIL contention here)
- Update: Update model on GPU 0
DP Code
import torch import torch.nn as nn model = YourModel() model = nn.DataParallel(model) model = model.to('cuda:0') for batch in dataloader: inputs = batch['input'].to('cuda:0') # Always to GPU 0 outputs = model(inputs) # Automatically distributed loss = criterion(outputs, targets) loss.backward() optimizer.step()
DP Problems
-
Model Replication Every Forward Pass
- Entire model copied to each GPU
- Significant overhead for large models
-
Memory Imbalance
- GPU 0 gathers all outputs
- GPU 0 uses more memory than others
-
GIL Bottleneck
- Python operations serialized
- Limits parallel efficiency
-
No Gradient Bucketing
- Gradients synchronized less efficiently
DistributedDataParallel Deep Dive
How DDP Works
- Initialize: Each process loads its own model copy
- Data Sharding: Each process gets different data
- Forward: Independent forward passes
- Backward: Local gradient computation
- AllReduce: NCCL synchronizes gradients
- Update: Each process updates its own model
DDP Code
import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP from torch.utils.data.distributed import DistributedSampler def main(rank, world_size): # Initialize process group dist.init_process_group( backend='nccl', init_method='env://', world_size=world_size, rank=rank ) # Create model on this GPU model = YourModel().to(rank) model = DDP(model, device_ids=[rank]) # Use DistributedSampler for data sharding sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank) dataloader = DataLoader(dataset, sampler=sampler, batch_size=32) for epoch in range(epochs): sampler.set_epoch(epoch) # Shuffle differently each epoch for batch in dataloader: inputs = batch['input'].to(rank) outputs = model(inputs) loss = criterion(outputs, targets) loss.backward() # Gradients synced automatically! optimizer.step() dist.destroy_process_group() # Launch with torchrun # torchrun --nproc_per_node=4 train.py
DDP Advantages
-
No Model Replication
- Model stays on each GPU
- Zero per-iteration overhead
-
Balanced Memory
- Each GPU has same workload
- No master GPU bottleneck
-
Efficient Gradient Sync
- NCCL ring allreduce
- Gradient bucketing for overlapped communication
-
No GIL Contention
- Separate processes
- True parallelism
Gradient Synchronization
DDP's Ring AllReduce
DDP uses NCCL's ring allreduce algorithm:
GPU 0 ──► GPU 1 ──► GPU 2 ──► GPU 3 ▲ │ └───────────────────────────────┘
Each GPU:
- Sends gradients to neighbor
- Receives gradients from other neighbor
- After
n-1rounds, all GPUs have the sum
Gradient Bucketing
DDP groups gradients into buckets for overlapped communication:
# DDP starts communication before backward completes! layer_3.backward() ─► Start AllReduce(bucket_3) layer_2.backward() ─► Start AllReduce(bucket_2) # Overlaps with bucket_3 layer_1.backward() ─► Start AllReduce(bucket_1) # Overlaps with bucket_2
Launching DDP
Using torchrun (Recommended)
# Single node, 4 GPUs torchrun --nproc_per_node=4 train.py # Multi-node, 4 GPUs each, 2 nodes torchrun --nnodes=2 --nproc_per_node=4 \ --rdzv_id=job1 --rdzv_backend=c10d \ --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \ train.py
Using torch.multiprocessing
import torch.multiprocessing as mp if __name__ == '__main__': world_size = torch.cuda.device_count() mp.spawn(main, args=(world_size,), nprocs=world_size)
Comparison Table
| Aspect | DataParallel | DistributedDataParallel |
|---|---|---|
| Processes | 1 | N (one per GPU) |
| GIL Impact | Bottleneck | None |
| Model Location | Replicated per forward | Stays on each GPU |
| Memory Balance | GPU 0 heavy | Balanced |
| Communication | Scatter/Gather | AllReduce (NCCL) |
| Gradient Sync | Implicit | Bucketed, overlapped |
| Scaling | 2-4 GPUs max | 100s of GPUs |
| Code Complexity | Simple | More setup |
When to Use Each
Use DataParallel When:
- Quick prototyping
- Single machine, 2-4 GPUs
- Don't want to modify training loop
- Small models where replication is cheap
Use DDP When:
- Production training
- More than 2-4 GPUs
- Multi-node training
- Large models
- Need best performance
Migration from DP to DDP
# Before (DataParallel) model = nn.DataParallel(model) model.to('cuda:0') for batch in dataloader: outputs = model(batch['input'].to('cuda:0')) loss.backward() optimizer.step() # After (DDP) dist.init_process_group(backend='nccl') model = DDP(model.to(local_rank), device_ids=[local_rank]) sampler = DistributedSampler(dataset) dataloader = DataLoader(dataset, sampler=sampler) for epoch in range(epochs): sampler.set_epoch(epoch) for batch in dataloader: outputs = model(batch['input'].to(local_rank)) loss.backward() # AllReduce happens here optimizer.step()
Related Concepts
- DataLoader Pipeline - How data flows to GPUs
- num_workers - Parallel data loading
- NCCL Communication - GPU-to-GPU communication
- Distributed Parallelism - Parallelism strategies
- Python GIL - Why DDP beats DP
