DataParallel vs DistributedDataParallel
When scaling PyTorch training to multiple GPUs, you have two main options: DataParallel (DP) and DistributedDataParallel (DDP). While DP is simpler, DDP is almost always the better choice for serious multi-GPU training.
Interactive Visualization
The Core Difference
DataParallel (DP): Single Process
# Simple, but limited model = nn.DataParallel(model) model = model.to('cuda:0') # Primary GPU
DP runs in a single Python process. This means:
- One Python interpreter
- One GIL (Global Interpreter Lock)
- All GPU operations managed by one process
DistributedDataParallel (DDP): Multiple Processes
# More setup, but scales better import torch.distributed as dist dist.init_process_group(backend='nccl') model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
DDP launches one process per GPU. This means:
- Multiple Python interpreters
- No GIL contention between GPUs
- Each process manages its own GPU
Why the GIL Matters
Python's Global Interpreter Lock prevents true parallel execution of Python bytecode:
# With DataParallel (single process) # Thread 1: Forward on GPU 0 ─┐ # Thread 2: Forward on GPU 1 ─┼─► Only ONE thread runs Python at a time! # Thread 3: Forward on GPU 2 ─┤ # Thread 4: Forward on GPU 3 ─┘ # With DDP (multiple processes) # Process 0: Forward on GPU 0 ───► Independent Python interpreter # Process 1: Forward on GPU 1 ───► Independent Python interpreter # Process 2: Forward on GPU 2 ───► Independent Python interpreter # Process 3: Forward on GPU 3 ───► Independent Python interpreter
The GIL is why DP's scaling degrades significantly beyond 2-4 GPUs.
DataParallel Deep Dive
How DP Works
- Scatter: Split batch across GPUs
- Replicate: Copy model to each GPU (every forward pass!)
- Forward: Run forward pass on each GPU
- Gather: Collect outputs on GPU 0
- Backward: Compute gradients (GIL contention here)
- Update: Update model on GPU 0
DP Code
import torch import torch.nn as nn model = YourModel() model = nn.DataParallel(model) model = model.to('cuda:0') for batch in dataloader: inputs = batch['input'].to('cuda:0') # Always to GPU 0 outputs = model(inputs) # Automatically distributed loss = criterion(outputs, targets) loss.backward() optimizer.step()
DP Problems
-
Model Replication Every Forward Pass
- Entire model copied to each GPU
- Significant overhead for large models
-
Memory Imbalance
- GPU 0 gathers all outputs
- GPU 0 uses more memory than others
-
GIL Bottleneck
- Python operations serialized
- Limits parallel efficiency
-
No Gradient Bucketing
- Gradients synchronized less efficiently
DistributedDataParallel Deep Dive
How DDP Works
- Initialize: Each process loads its own model copy
- Data Sharding: Each process gets different data
- Forward: Independent forward passes
- Backward: Local gradient computation
- AllReduce: NCCL synchronizes gradients
- Update: Each process updates its own model
DDP Code
import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP from torch.utils.data.distributed import DistributedSampler def main(rank, world_size): # Initialize process group dist.init_process_group( backend='nccl', init_method='env://', world_size=world_size, rank=rank ) # Create model on this GPU model = YourModel().to(rank) model = DDP(model, device_ids=[rank]) # Use DistributedSampler for data sharding sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank) dataloader = DataLoader(dataset, sampler=sampler, batch_size=32) for epoch in range(epochs): sampler.set_epoch(epoch) # Shuffle differently each epoch for batch in dataloader: inputs = batch['input'].to(rank) outputs = model(inputs) loss = criterion(outputs, targets) loss.backward() # Gradients synced automatically! optimizer.step() dist.destroy_process_group() # Launch with torchrun # torchrun --nproc_per_node=4 train.py
DDP Advantages
-
No Model Replication
- Model stays on each GPU
- Zero per-iteration overhead
-
Balanced Memory
- Each GPU has same workload
- No master GPU bottleneck
-
Efficient Gradient Sync
- NCCL ring allreduce
- Gradient bucketing for overlapped communication
-
No GIL Contention
- Separate processes
- True parallelism
Gradient Synchronization
DDP's Ring AllReduce
DDP uses NCCL's ring allreduce algorithm:
GPU 0 ──► GPU 1 ──► GPU 2 ──► GPU 3 ▲ │ └───────────────────────────────┘
Each GPU:
- Sends gradients to neighbor
- Receives gradients from other neighbor
- After
n-1rounds, all GPUs have the sum
Gradient Bucketing
DDP groups gradients into buckets for overlapped communication:
# DDP starts communication before backward completes! layer_3.backward() ─► Start AllReduce(bucket_3) layer_2.backward() ─► Start AllReduce(bucket_2) # Overlaps with bucket_3 layer_1.backward() ─► Start AllReduce(bucket_1) # Overlaps with bucket_2
Launching DDP
Using torchrun (Recommended)
# Single node, 4 GPUs torchrun --nproc_per_node=4 train.py # Multi-node, 4 GPUs each, 2 nodes torchrun --nnodes=2 --nproc_per_node=4 \ --rdzv_id=job1 --rdzv_backend=c10d \ --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \ train.py
Using torch.multiprocessing
import torch.multiprocessing as mp if __name__ == '__main__': world_size = torch.cuda.device_count() mp.spawn(main, args=(world_size,), nprocs=world_size)
Comparison Table
| Aspect | DataParallel | DistributedDataParallel |
|---|---|---|
| Processes | 1 | N (one per GPU) |
| GIL Impact | Bottleneck | None |
| Model Location | Replicated per forward | Stays on each GPU |
| Memory Balance | GPU 0 heavy | Balanced |
| Communication | Scatter/Gather | AllReduce (NCCL) |
| Gradient Sync | Implicit | Bucketed, overlapped |
| Scaling | 2-4 GPUs max | 100s of GPUs |
| Code Complexity | Simple | More setup |
When to Use Each
Use DataParallel When:
- Quick prototyping
- Single machine, 2-4 GPUs
- Don't want to modify training loop
- Small models where replication is cheap
Use DDP When:
- Production training
- More than 2-4 GPUs
- Multi-node training
- Large models
- Need best performance
Migration from DP to DDP
# Before (DataParallel) model = nn.DataParallel(model) model.to('cuda:0') for batch in dataloader: outputs = model(batch['input'].to('cuda:0')) loss.backward() optimizer.step() # After (DDP) dist.init_process_group(backend='nccl') model = DDP(model.to(local_rank), device_ids=[local_rank]) sampler = DistributedSampler(dataset) dataloader = DataLoader(dataset, sampler=sampler) for epoch in range(epochs): sampler.set_epoch(epoch) for batch in dataloader: outputs = model(batch['input'].to(local_rank)) loss.backward() # AllReduce happens here optimizer.step()
Related Concepts
- DataLoader Pipeline - How data flows to GPUs
- num_workers - Parallel data loading
- NCCL Communication - GPU-to-GPU communication
- Distributed Parallelism - Parallelism strategies
- Python GIL - Why DDP beats DP
