DataParallel vs DistributedDataParallel

When scaling PyTorch training to multiple GPUs, you have two main options: DataParallel (DP) and DistributedDataParallel (DDP). While DP is simpler, DDP is almost always the better choice for serious multi-GPU training.

Interactive Visualization

The Core Difference

DataParallel (DP): Single Process

# Simple, but limited
model = nn.DataParallel(model)
model = model.to('cuda:0')  # Primary GPU

DP runs in a single Python process. This means:

One Python interpreter
One GIL (Global Interpreter Lock)
All GPU operations managed by one process

DistributedDataParallel (DDP): Multiple Processes

# More setup, but scales better
import torch.distributed as dist

dist.init_process_group(backend='nccl')
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])

DDP launches one process per GPU. This means:

Multiple Python interpreters
No GIL contention between GPUs
Each process manages its own GPU

Why the GIL Matters

Python's Global Interpreter Lock prevents true parallel execution of Python bytecode:

# With DataParallel (single process)
# Thread 1: Forward on GPU 0  ─┐
# Thread 2: Forward on GPU 1  ─┼─► Only ONE thread runs Python at a time!
# Thread 3: Forward on GPU 2  ─┤
# Thread 4: Forward on GPU 3  ─┘

# With DDP (multiple processes)
# Process 0: Forward on GPU 0  ───► Independent Python interpreter
# Process 1: Forward on GPU 1  ───► Independent Python interpreter
# Process 2: Forward on GPU 2  ───► Independent Python interpreter
# Process 3: Forward on GPU 3  ───► Independent Python interpreter

The GIL is why DP's scaling degrades significantly beyond 2-4 GPUs.

DataParallel Deep Dive

How DP Works

Scatter: Split batch across GPUs
Replicate: Copy model to each GPU (every forward pass!)
Forward: Run forward pass on each GPU
Gather: Collect outputs on GPU 0
Backward: Compute gradients (GIL contention here)
Update: Update model on GPU 0

DP Code

import torch
import torch.nn as nn

model = YourModel()
model = nn.DataParallel(model)
model = model.to('cuda:0')

for batch in dataloader:
    inputs = batch['input'].to('cuda:0')  # Always to GPU 0
    outputs = model(inputs)  # Automatically distributed
    loss = criterion(outputs, targets)
    loss.backward()
    optimizer.step()

DP Problems

Model Replication Every Forward Pass
- Entire model copied to each GPU
- Significant overhead for large models
Memory Imbalance
- GPU 0 gathers all outputs
- GPU 0 uses more memory than others
GIL Bottleneck
- Python operations serialized
- Limits parallel efficiency
No Gradient Bucketing
- Gradients synchronized less efficiently

DistributedDataParallel Deep Dive

How DDP Works

Initialize: Each process loads its own model copy
Data Sharding: Each process gets different data
Forward: Independent forward passes
Backward: Local gradient computation
AllReduce: NCCL synchronizes gradients
Update: Each process updates its own model

DDP Code

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

def main(rank, world_size):
    # Initialize process group
    dist.init_process_group(
        backend='nccl',
        init_method='env://',
        world_size=world_size,
        rank=rank
    )

    # Create model on this GPU
    model = YourModel().to(rank)
    model = DDP(model, device_ids=[rank])

    # Use DistributedSampler for data sharding
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

    for epoch in range(epochs):
        sampler.set_epoch(epoch)  # Shuffle differently each epoch
        for batch in dataloader:
            inputs = batch['input'].to(rank)
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            loss.backward()  # Gradients synced automatically!
            optimizer.step()

    dist.destroy_process_group()

# Launch with torchrun
# torchrun --nproc_per_node=4 train.py

DDP Advantages

No Model Replication
- Model stays on each GPU
- Zero per-iteration overhead
Balanced Memory
- Each GPU has same workload
- No master GPU bottleneck
Efficient Gradient Sync
- NCCL ring allreduce
- Gradient bucketing for overlapped communication
No GIL Contention
- Separate processes
- True parallelism

Gradient Synchronization

DDP's Ring AllReduce

DDP uses NCCL's ring allreduce algorithm:

GPU 0 ──► GPU 1 ──► GPU 2 ──► GPU 3
  ▲                               │
  └───────────────────────────────┘

Each GPU:

Sends gradients to neighbor
Receives gradients from other neighbor
After n-1 rounds, all GPUs have the sum

Gradient Bucketing

DDP groups gradients into buckets for overlapped communication:

# DDP starts communication before backward completes!
layer_3.backward()  ─► Start AllReduce(bucket_3)
layer_2.backward()  ─► Start AllReduce(bucket_2)  # Overlaps with bucket_3
layer_1.backward()  ─► Start AllReduce(bucket_1)  # Overlaps with bucket_2

Launching DDP

Using torchrun (Recommended)

# Single node, 4 GPUs
torchrun --nproc_per_node=4 train.py

# Multi-node, 4 GPUs each, 2 nodes
torchrun --nnodes=2 --nproc_per_node=4 \
    --rdzv_id=job1 --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    train.py

Using torch.multiprocessing

import torch.multiprocessing as mp

if __name__ == '__main__':
    world_size = torch.cuda.device_count()
    mp.spawn(main, args=(world_size,), nprocs=world_size)

Comparison Table

Aspect	DataParallel	DistributedDataParallel
Processes	1	N (one per GPU)
GIL Impact	Bottleneck	None
Model Location	Replicated per forward	Stays on each GPU
Memory Balance	GPU 0 heavy	Balanced
Communication	Scatter/Gather	AllReduce (NCCL)
Gradient Sync	Implicit	Bucketed, overlapped
Scaling	2-4 GPUs max	100s of GPUs
Code Complexity	Simple	More setup

When to Use Each

Use DataParallel When:

Quick prototyping
Single machine, 2-4 GPUs
Don't want to modify training loop
Small models where replication is cheap

Use DDP When:

Production training
More than 2-4 GPUs
Multi-node training
Large models
Need best performance

Migration from DP to DDP

# Before (DataParallel)
model = nn.DataParallel(model)
model.to('cuda:0')

for batch in dataloader:
    outputs = model(batch['input'].to('cuda:0'))
    loss.backward()
    optimizer.step()

# After (DDP)
dist.init_process_group(backend='nccl')
model = DDP(model.to(local_rank), device_ids=[local_rank])
sampler = DistributedSampler(dataset)
dataloader = DataLoader(dataset, sampler=sampler)

for epoch in range(epochs):
    sampler.set_epoch(epoch)
    for batch in dataloader:
        outputs = model(batch['input'].to(local_rank))
        loss.backward()  # AllReduce happens here
        optimizer.step()

GPU & High-Performance Computing

Multi-GPU Communication: NVLink, PCIe, and NCCL

How GPUs talk: the bandwidth cliff from HBM to Ethernet, NVLink 5 and GB200 NVL72 topologies, ring AllReduce step by step, and choosing between NCCL, Gloo, and MPI.

GPU & High-Performance Computing

NCCL: How NVIDIA Collective Communication Works

A deep dive into NCCL internals: communicators and channels, how it picks ring/tree/NVLS algorithms and LL/LL128/Simple protocols, reading NCCL_DEBUG logs, and tuning and debugging distributed training.

Language & Framework Internals

PyTorch DataLoader Pipeline

PyTorch DataLoader deep dive — Dataset, Sampler, Workers, Collate internals, num_workers throughput profiling, memory analysis, serialization costs, production patterns (LMDB, WebDataset), and bottleneck diagnosis.

Language & Framework Internals

Understanding num_workers

Deep dive into PyTorch DataLoader num_workers parameter: how parallel workers prefetch data, optimal configuration, and common pitfalls.

Language & Framework Internals

Pinned Memory and DMA Transfers in PyTorch

Complete guide to PyTorch pin_memory — how DMA transfers work, when pinning helps vs hurts, NUMA effects, profiling with torch.profiler, num_workers interaction, and debugging slow data loading.

Deep Learning

Batch Normalization in Deep Learning

Learn batch normalization in deep learning: how normalizing layer inputs accelerates training, improves gradient flow, and acts as regularization.