DataParallel vs DistributedDataParallel

Compare PyTorch DataParallel vs DistributedDataParallel for multi-GPU training. Learn GIL limitations, NCCL AllReduce, and DDP best practices.

Best viewed on desktop for optimal interactive experience

DataParallel vs DistributedDataParallel

When scaling PyTorch training to multiple GPUs, you have two main options: DataParallel (DP) and DistributedDataParallel (DDP). While DP is simpler, DDP is almost always the better choice for serious multi-GPU training.

Interactive Visualization

DataParallel vs DistributedDataParallel

DataParallel (DP)

Single process, single Python interpreter. GIL limits parallelism. Model replicated every forward pass.

Single Process (GIL Bottleneck)

Single Python ProcessGIL Active
GPU 0Master
Model
GPU 1
GPU 2
GPU 3

Scaling Efficiency (relative speedup)

1 GPU
1.0x
1.0x
2 GPUs
1.6x
1.9x
4 GPUs
2.5x
3.8x
8 GPUs
3.2x
7.5x
DataParallel
DDP

Step 1: Scatter Input

GPU 0 splits batch and scatters to all GPUs

DataParallel Issues
  • Single process limited by GIL
  • Model replicated every forward pass
  • GPU 0 memory imbalance (gathers outputs)
  • Poor scaling beyond 2-4 GPUs
DDP Advantages
  • Multi-process, no GIL contention
  • Model stays on each GPU (no replication)
  • Balanced memory usage across GPUs
  • Near-linear scaling to many GPUs

The Core Difference

DataParallel (DP): Single Process

# Simple, but limited model = nn.DataParallel(model) model = model.to('cuda:0') # Primary GPU

DP runs in a single Python process. This means:

  • One Python interpreter
  • One GIL (Global Interpreter Lock)
  • All GPU operations managed by one process

DistributedDataParallel (DDP): Multiple Processes

# More setup, but scales better import torch.distributed as dist dist.init_process_group(backend='nccl') model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])

DDP launches one process per GPU. This means:

  • Multiple Python interpreters
  • No GIL contention between GPUs
  • Each process manages its own GPU

Why the GIL Matters

Python's Global Interpreter Lock prevents true parallel execution of Python bytecode:

# With DataParallel (single process) # Thread 1: Forward on GPU 0 ─┐ # Thread 2: Forward on GPU 1 ─┼─► Only ONE thread runs Python at a time! # Thread 3: Forward on GPU 2 ─┤ # Thread 4: Forward on GPU 3 ─┘ # With DDP (multiple processes) # Process 0: Forward on GPU 0 ───► Independent Python interpreter # Process 1: Forward on GPU 1 ───► Independent Python interpreter # Process 2: Forward on GPU 2 ───► Independent Python interpreter # Process 3: Forward on GPU 3 ───► Independent Python interpreter

The GIL is why DP's scaling degrades significantly beyond 2-4 GPUs.

DataParallel Deep Dive

How DP Works

  1. Scatter: Split batch across GPUs
  2. Replicate: Copy model to each GPU (every forward pass!)
  3. Forward: Run forward pass on each GPU
  4. Gather: Collect outputs on GPU 0
  5. Backward: Compute gradients (GIL contention here)
  6. Update: Update model on GPU 0

DP Code

import torch import torch.nn as nn model = YourModel() model = nn.DataParallel(model) model = model.to('cuda:0') for batch in dataloader: inputs = batch['input'].to('cuda:0') # Always to GPU 0 outputs = model(inputs) # Automatically distributed loss = criterion(outputs, targets) loss.backward() optimizer.step()

DP Problems

  1. Model Replication Every Forward Pass

    • Entire model copied to each GPU
    • Significant overhead for large models
  2. Memory Imbalance

    • GPU 0 gathers all outputs
    • GPU 0 uses more memory than others
  3. GIL Bottleneck

    • Python operations serialized
    • Limits parallel efficiency
  4. No Gradient Bucketing

    • Gradients synchronized less efficiently

DistributedDataParallel Deep Dive

How DDP Works

  1. Initialize: Each process loads its own model copy
  2. Data Sharding: Each process gets different data
  3. Forward: Independent forward passes
  4. Backward: Local gradient computation
  5. AllReduce: NCCL synchronizes gradients
  6. Update: Each process updates its own model

DDP Code

import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP from torch.utils.data.distributed import DistributedSampler def main(rank, world_size): # Initialize process group dist.init_process_group( backend='nccl', init_method='env://', world_size=world_size, rank=rank ) # Create model on this GPU model = YourModel().to(rank) model = DDP(model, device_ids=[rank]) # Use DistributedSampler for data sharding sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank) dataloader = DataLoader(dataset, sampler=sampler, batch_size=32) for epoch in range(epochs): sampler.set_epoch(epoch) # Shuffle differently each epoch for batch in dataloader: inputs = batch['input'].to(rank) outputs = model(inputs) loss = criterion(outputs, targets) loss.backward() # Gradients synced automatically! optimizer.step() dist.destroy_process_group() # Launch with torchrun # torchrun --nproc_per_node=4 train.py

DDP Advantages

  1. No Model Replication

    • Model stays on each GPU
    • Zero per-iteration overhead
  2. Balanced Memory

    • Each GPU has same workload
    • No master GPU bottleneck
  3. Efficient Gradient Sync

    • NCCL ring allreduce
    • Gradient bucketing for overlapped communication
  4. No GIL Contention

    • Separate processes
    • True parallelism

Gradient Synchronization

DDP's Ring AllReduce

DDP uses NCCL's ring allreduce algorithm:

GPU 0 ──► GPU 1 ──► GPU 2 ──► GPU 3 ▲ │ └───────────────────────────────┘

Each GPU:

  1. Sends gradients to neighbor
  2. Receives gradients from other neighbor
  3. After n-1 rounds, all GPUs have the sum

Gradient Bucketing

DDP groups gradients into buckets for overlapped communication:

# DDP starts communication before backward completes! layer_3.backward() ─► Start AllReduce(bucket_3) layer_2.backward() ─► Start AllReduce(bucket_2) # Overlaps with bucket_3 layer_1.backward() ─► Start AllReduce(bucket_1) # Overlaps with bucket_2

Launching DDP

# Single node, 4 GPUs torchrun --nproc_per_node=4 train.py # Multi-node, 4 GPUs each, 2 nodes torchrun --nnodes=2 --nproc_per_node=4 \ --rdzv_id=job1 --rdzv_backend=c10d \ --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \ train.py

Using torch.multiprocessing

import torch.multiprocessing as mp if __name__ == '__main__': world_size = torch.cuda.device_count() mp.spawn(main, args=(world_size,), nprocs=world_size)

Comparison Table

AspectDataParallelDistributedDataParallel
Processes1N (one per GPU)
GIL ImpactBottleneckNone
Model LocationReplicated per forwardStays on each GPU
Memory BalanceGPU 0 heavyBalanced
CommunicationScatter/GatherAllReduce (NCCL)
Gradient SyncImplicitBucketed, overlapped
Scaling2-4 GPUs max100s of GPUs
Code ComplexitySimpleMore setup

When to Use Each

Use DataParallel When:

  • Quick prototyping
  • Single machine, 2-4 GPUs
  • Don't want to modify training loop
  • Small models where replication is cheap

Use DDP When:

  • Production training
  • More than 2-4 GPUs
  • Multi-node training
  • Large models
  • Need best performance

Migration from DP to DDP

# Before (DataParallel) model = nn.DataParallel(model) model.to('cuda:0') for batch in dataloader: outputs = model(batch['input'].to('cuda:0')) loss.backward() optimizer.step() # After (DDP) dist.init_process_group(backend='nccl') model = DDP(model.to(local_rank), device_ids=[local_rank]) sampler = DistributedSampler(dataset) dataloader = DataLoader(dataset, sampler=sampler) for epoch in range(epochs): sampler.set_epoch(epoch) for batch in dataloader: outputs = model(batch['input'].to(local_rank)) loss.backward() # AllReduce happens here optimizer.step()

If you found this explanation helpful, consider sharing it with others.

Mastodon