DataParallel vs DistributedDataParallel

When scaling PyTorch training to multiple GPUs, you have two main options: DataParallel (DP) and DistributedDataParallel (DDP). While DP is simpler, DDP is almost always the better choice for serious multi-GPU training.

Interactive Visualization

DataParallel vs DistributedDataParallel

DataParallel (DP)

Single process, single Python interpreter. GIL limits parallelism. Model replicated every forward pass.

Single Process (GIL Bottleneck)

Single Python ProcessGIL Active

GPU 0Master

Model

GPU 1

GPU 2

GPU 3

Scaling Efficiency (relative speedup)

1 GPU

1.0x

2 GPUs

1.6x

1.9x

4 GPUs

2.5x

3.8x

8 GPUs

3.2x

7.5x

DataParallel

DDP

Step 1: Scatter Input

GPU 0 splits batch and scatters to all GPUs

DataParallel Issues

Single process limited by GIL
Model replicated every forward pass
GPU 0 memory imbalance (gathers outputs)
Poor scaling beyond 2-4 GPUs

DDP Advantages

Multi-process, no GIL contention
Model stays on each GPU (no replication)
Balanced memory usage across GPUs
Near-linear scaling to many GPUs

The Core Difference

DataParallel (DP): Single Process

# Simple, but limited
model = nn.DataParallel(model)
model = model.to('cuda:0')  # Primary GPU

DP runs in a single Python process. This means:

One Python interpreter
One GIL (Global Interpreter Lock)
All GPU operations managed by one process

DistributedDataParallel (DDP): Multiple Processes

# More setup, but scales better
import torch.distributed as dist

dist.init_process_group(backend='nccl')
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])

DDP launches one process per GPU. This means:

Multiple Python interpreters
No GIL contention between GPUs
Each process manages its own GPU

Why the GIL Matters

Python's Global Interpreter Lock prevents true parallel execution of Python bytecode:

# With DataParallel (single process)
# Thread 1: Forward on GPU 0  ─┐
# Thread 2: Forward on GPU 1  ─┼─► Only ONE thread runs Python at a time!
# Thread 3: Forward on GPU 2  ─┤
# Thread 4: Forward on GPU 3  ─┘

# With DDP (multiple processes)
# Process 0: Forward on GPU 0  ───► Independent Python interpreter
# Process 1: Forward on GPU 1  ───► Independent Python interpreter
# Process 2: Forward on GPU 2  ───► Independent Python interpreter
# Process 3: Forward on GPU 3  ───► Independent Python interpreter

The GIL is why DP's scaling degrades significantly beyond 2-4 GPUs.

DataParallel Deep Dive

How DP Works

Scatter: Split batch across GPUs
Replicate: Copy model to each GPU (every forward pass!)
Forward: Run forward pass on each GPU
Gather: Collect outputs on GPU 0
Backward: Compute gradients (GIL contention here)
Update: Update model on GPU 0

DP Code

import torch
import torch.nn as nn

model = YourModel()
model = nn.DataParallel(model)
model = model.to('cuda:0')

for batch in dataloader:
    inputs = batch['input'].to('cuda:0')  # Always to GPU 0
    outputs = model(inputs)  # Automatically distributed
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()

DP Problems

Model Replication Every Forward Pass
- Entire model copied to each GPU
- Significant overhead for large models
Memory Imbalance
- GPU 0 gathers all outputs
- GPU 0 uses more memory than others
GIL Bottleneck
- Python operations serialized
- Limits parallel efficiency
No Gradient Bucketing
- Gradients synchronized less efficiently

DistributedDataParallel Deep Dive

How DDP Works

Initialize: Each process loads its own model copy
Data Sharding: Each process gets different data
Forward: Independent forward passes
Backward: Local gradient computation
AllReduce: NCCL synchronizes gradients
Update: Each process updates its own model

DDP Code

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

def main(rank, world_size):
    # Initialize process group
    dist.init_process_group(
        backend='nccl',
        init_method='env://',
        world_size=world_size,
        rank=rank
    )

    # Create model on this GPU
    model = YourModel().to(rank)
    model = DDP(model, device_ids=[rank])

    # Use DistributedSampler for data sharding
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

    for epoch in range(epochs):
        sampler.set_epoch(epoch)  # Shuffle differently each epoch
        for batch in dataloader:
            inputs = batch['input'].to(rank)
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            loss.backward()  # Gradients synced automatically!
            optimizer.step()

    dist.destroy_process_group()

# Launch with torchrun
# torchrun --nproc_per_node=4 train.py

DDP Advantages

No Model Replication
- Model stays on each GPU
- Zero per-iteration overhead
Balanced Memory
- Each GPU has same workload
- No master GPU bottleneck
Efficient Gradient Sync
- NCCL ring allreduce
- Gradient bucketing for overlapped communication
No GIL Contention
- Separate processes
- True parallelism

Gradient Synchronization

DDP's Ring AllReduce

DDP uses NCCL's ring allreduce algorithm:

GPU 0 ──► GPU 1 ──► GPU 2 ──► GPU 3
  ▲                               │
  └───────────────────────────────┘

Each GPU:

Sends gradients to neighbor
Receives gradients from other neighbor
After n-1 rounds, all GPUs have the sum

Gradient Bucketing

DDP groups gradients into buckets for overlapped communication:

# DDP starts communication before backward completes!
layer_3.backward()  ─► Start AllReduce(bucket_3)
layer_2.backward()  ─► Start AllReduce(bucket_2)  # Overlaps with bucket_3
layer_1.backward()  ─► Start AllReduce(bucket_1)  # Overlaps with bucket_2

Launching DDP

Using torchrun (Recommended)

# Single node, 4 GPUs
torchrun --nproc_per_node=4 train.py

# Multi-node, 4 GPUs each, 2 nodes
torchrun --nnodes=2 --nproc_per_node=4 \
    --rdzv_id=job1 --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    train.py

Using torch.multiprocessing

import torch.multiprocessing as mp

if __name__ == '__main__':
    world_size = torch.cuda.device_count()
    mp.spawn(main, args=(world_size,), nprocs=world_size)

Comparison Table

Aspect	DataParallel	DistributedDataParallel
Processes	1	N (one per GPU)
GIL Impact	Bottleneck	None
Model Location	Replicated per forward	Stays on each GPU
Memory Balance	GPU 0 heavy	Balanced
Communication	Scatter/Gather	AllReduce (NCCL)
Gradient Sync	Implicit	Bucketed, overlapped
Scaling	2-4 GPUs max	100s of GPUs
Code Complexity	Simple	More setup

When to Use Each

Use DataParallel When:

Quick prototyping
Single machine, 2-4 GPUs
Don't want to modify training loop
Small models where replication is cheap

Use DDP When:

Production training
More than 2-4 GPUs
Multi-node training
Large models
Need best performance

Migration from DP to DDP

# Before (DataParallel)
model = nn.DataParallel(model)
model.to('cuda:0')

for batch in dataloader:
    outputs = model(batch['input'].to('cuda:0'))
    loss.backward()
    optimizer.step()

# After (DDP)
dist.init_process_group(backend='nccl')
model = DDP(model.to(local_rank), device_ids=[local_rank])
sampler = DistributedSampler(dataset)
dataloader = DataLoader(dataset, sampler=sampler)

for epoch in range(epochs):
    sampler.set_epoch(epoch)
    for batch in dataloader:
        outputs = model(batch['input'].to(local_rank))
        loss.backward()  # AllReduce happens here
        optimizer.step()

DataLoader Pipeline - How data flows to GPUs
num_workers - Parallel data loading
NCCL Communication - GPU-to-GPU communication
Distributed Parallelism - Parallelism strategies
Python GIL - Why DDP beats DP