Skip to main content

DataParallel vs DistributedDataParallel

Compare PyTorch DataParallel vs DistributedDataParallel for multi-GPU training. Learn GIL limitations, NCCL AllReduce, and DDP best practices.

DataParallel vs DistributedDataParallel

When scaling PyTorch training to multiple GPUs, you have two main options: DataParallel (DP) and DistributedDataParallel (DDP). While DP is simpler, DDP is almost always the better choice for serious multi-GPU training.

Interactive Visualization

The Core Difference

DataParallel (DP): Single Process

# Simple, but limited model = nn.DataParallel(model) model = model.to('cuda:0') # Primary GPU

DP runs in a single Python process. This means:

  • One Python interpreter
  • One GIL (Global Interpreter Lock)
  • All GPU operations managed by one process

DistributedDataParallel (DDP): Multiple Processes

# More setup, but scales better import torch.distributed as dist dist.init_process_group(backend='nccl') model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])

DDP launches one process per GPU. This means:

  • Multiple Python interpreters
  • No GIL contention between GPUs
  • Each process manages its own GPU

Why the GIL Matters

Python's Global Interpreter Lock prevents true parallel execution of Python bytecode:

# With DataParallel (single process) # Thread 1: Forward on GPU 0 ─┐ # Thread 2: Forward on GPU 1 ─┼─► Only ONE thread runs Python at a time! # Thread 3: Forward on GPU 2 ─┤ # Thread 4: Forward on GPU 3 ─┘ # With DDP (multiple processes) # Process 0: Forward on GPU 0 ───► Independent Python interpreter # Process 1: Forward on GPU 1 ───► Independent Python interpreter # Process 2: Forward on GPU 2 ───► Independent Python interpreter # Process 3: Forward on GPU 3 ───► Independent Python interpreter

The GIL is why DP's scaling degrades significantly beyond 2-4 GPUs.

DataParallel Deep Dive

How DP Works

  1. Scatter: Split batch across GPUs
  2. Replicate: Copy model to each GPU (every forward pass!)
  3. Forward: Run forward pass on each GPU
  4. Gather: Collect outputs on GPU 0
  5. Backward: Compute gradients (GIL contention here)
  6. Update: Update model on GPU 0

DP Code

import torch import torch.nn as nn model = YourModel() model = nn.DataParallel(model) model = model.to('cuda:0') for batch in dataloader: inputs = batch['input'].to('cuda:0') # Always to GPU 0 outputs = model(inputs) # Automatically distributed loss = criterion(outputs, targets) loss.backward() optimizer.step()

DP Problems

  1. Model Replication Every Forward Pass

    • Entire model copied to each GPU
    • Significant overhead for large models
  2. Memory Imbalance

    • GPU 0 gathers all outputs
    • GPU 0 uses more memory than others
  3. GIL Bottleneck

    • Python operations serialized
    • Limits parallel efficiency
  4. No Gradient Bucketing

    • Gradients synchronized less efficiently

DistributedDataParallel Deep Dive

How DDP Works

  1. Initialize: Each process loads its own model copy
  2. Data Sharding: Each process gets different data
  3. Forward: Independent forward passes
  4. Backward: Local gradient computation
  5. AllReduce: NCCL synchronizes gradients
  6. Update: Each process updates its own model

DDP Code

import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP from torch.utils.data.distributed import DistributedSampler def main(rank, world_size): # Initialize process group dist.init_process_group( backend='nccl', init_method='env://', world_size=world_size, rank=rank ) # Create model on this GPU model = YourModel().to(rank) model = DDP(model, device_ids=[rank]) # Use DistributedSampler for data sharding sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank) dataloader = DataLoader(dataset, sampler=sampler, batch_size=32) for epoch in range(epochs): sampler.set_epoch(epoch) # Shuffle differently each epoch for batch in dataloader: inputs = batch['input'].to(rank) outputs = model(inputs) loss = criterion(outputs, targets) loss.backward() # Gradients synced automatically! optimizer.step() dist.destroy_process_group() # Launch with torchrun # torchrun --nproc_per_node=4 train.py

DDP Advantages

  1. No Model Replication

    • Model stays on each GPU
    • Zero per-iteration overhead
  2. Balanced Memory

    • Each GPU has same workload
    • No master GPU bottleneck
  3. Efficient Gradient Sync

    • NCCL ring allreduce
    • Gradient bucketing for overlapped communication
  4. No GIL Contention

    • Separate processes
    • True parallelism

Gradient Synchronization

DDP's Ring AllReduce

DDP uses NCCL's ring allreduce algorithm:

GPU 0 ──► GPU 1 ──► GPU 2 ──► GPU 3 ▲ │ └───────────────────────────────┘

Each GPU:

  1. Sends gradients to neighbor
  2. Receives gradients from other neighbor
  3. After n-1 rounds, all GPUs have the sum

Gradient Bucketing

DDP groups gradients into buckets for overlapped communication:

# DDP starts communication before backward completes! layer_3.backward() ─► Start AllReduce(bucket_3) layer_2.backward() ─► Start AllReduce(bucket_2) # Overlaps with bucket_3 layer_1.backward() ─► Start AllReduce(bucket_1) # Overlaps with bucket_2

Launching DDP

# Single node, 4 GPUs torchrun --nproc_per_node=4 train.py # Multi-node, 4 GPUs each, 2 nodes torchrun --nnodes=2 --nproc_per_node=4 \ --rdzv_id=job1 --rdzv_backend=c10d \ --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \ train.py

Using torch.multiprocessing

import torch.multiprocessing as mp if __name__ == '__main__': world_size = torch.cuda.device_count() mp.spawn(main, args=(world_size,), nprocs=world_size)

Comparison Table

AspectDataParallelDistributedDataParallel
Processes1N (one per GPU)
GIL ImpactBottleneckNone
Model LocationReplicated per forwardStays on each GPU
Memory BalanceGPU 0 heavyBalanced
CommunicationScatter/GatherAllReduce (NCCL)
Gradient SyncImplicitBucketed, overlapped
Scaling2-4 GPUs max100s of GPUs
Code ComplexitySimpleMore setup

When to Use Each

Use DataParallel When:

  • Quick prototyping
  • Single machine, 2-4 GPUs
  • Don't want to modify training loop
  • Small models where replication is cheap

Use DDP When:

  • Production training
  • More than 2-4 GPUs
  • Multi-node training
  • Large models
  • Need best performance

Migration from DP to DDP

# Before (DataParallel) model = nn.DataParallel(model) model.to('cuda:0') for batch in dataloader: outputs = model(batch['input'].to('cuda:0')) loss.backward() optimizer.step() # After (DDP) dist.init_process_group(backend='nccl') model = DDP(model.to(local_rank), device_ids=[local_rank]) sampler = DistributedSampler(dataset) dataloader = DataLoader(dataset, sampler=sampler) for epoch in range(epochs): sampler.set_epoch(epoch) for batch in dataloader: outputs = model(batch['input'].to(local_rank)) loss.backward() # AllReduce happens here optimizer.step()

If you found this explanation helpful, consider sharing it with others.

Mastodon