Pinned Memory and DMA Transfers in PyTorch

The pin_memory=True parameter in PyTorch DataLoader enables faster CPU→GPU data transfers by using page-locked (pinned) memory. This seemingly simple flag can improve training throughput by 30-50% in I/O-bound workloads.

Interactive Visualization

Understanding Memory Types

Pageable Memory (Default)

When you allocate memory in Python, the operating system can swap it to disk:

# Standard Python/NumPy allocation uses pageable memory
data = np.zeros((1000, 1000))  # Can be swapped to disk

# PyTorch tensors also use pageable memory by default
tensor = torch.zeros(1000, 1000)  # Pageable

Characteristics:

Can be swapped to disk by OS
Virtual memory backed
GPU cannot access directly

Pinned Memory (Page-Locked)

Pinned memory is “locked” in physical RAM — the OS cannot swap it:

# Explicitly pinned tensor
pinned_tensor = torch.zeros(1000, 1000).pin_memory()

# Or via DataLoader
loader = DataLoader(dataset, pin_memory=True)

Characteristics:

Locked in physical RAM
Cannot be swapped to disk
GPU can access via DMA

How DMA Transfers Work

Without pinned memory, the CPU must copy data to a temporary pinned staging buffer before DMA can transfer it to the GPU. This means two copies: RAM → staging buffer → GPU VRAM. The CPU is blocked during the staging copy.

With pinned memory, the DMA controller reads directly from the pinned allocation. One copy: RAM → GPU VRAM. The CPU is free to do other work (like preparing the next batch) while the transfer runs asynchronously.

The Code Behind It

PyTorch’s Implementation

When pin_memory=True, the DataLoader calls pin_memory() on each batch:

# Simplified DataLoader logic
def fetch_batch():
    batch = collate_fn(samples)
    if self.pin_memory:
        batch = pin_memory(batch)
    return batch

def pin_memory(data):
    if isinstance(data, torch.Tensor):
        return data.pin_memory()
    elif isinstance(data, dict):
        return {k: pin_memory(v) for k, v in data.items()}
    elif isinstance(data, (list, tuple)):
        return type(data)(pin_memory(v) for v in data)
    return data

The Pinning Cost

Calling .pin_memory() is not free — it calls cudaHostAlloc() which:

Allocates page-aligned memory
Calls mlock() to prevent the OS from swapping those pages
Registers the memory with the CUDA driver for DMA access

This takes 10-100µs per call, much slower than a regular malloc. That’s why PyTorch’s DataLoader pins memory in a dedicated thread (pin_memory_thread) rather than in the main training loop.

# PyTorch uses a separate thread for pinning:
def _pin_memory_loop(in_queue, out_queue, device_id):
    while True:
        batch = in_queue.get()
        batch = pin_memory(batch)  # 10-100us, hidden behind data loading
        out_queue.put(batch)

Async Transfer with Pinned Memory

Pinned memory enables truly asynchronous transfers:

# Create CUDA stream
stream = torch.cuda.Stream()

# Synchronous (blocks until complete)
data_gpu = data_cpu.to('cuda')

# Asynchronous with pinned memory
data_pinned = data_cpu.pin_memory()
with torch.cuda.stream(stream):
    data_gpu = data_pinned.to('cuda', non_blocking=True)
# CPU continues immediately, transfer happens in background

Performance: When Pinning Helps

The throughput impact of pin_memory depends on batch size, GPU compute time, and whether data loading overlaps with compute.

The Overlap Principle

The real power of pinned memory is overlapping transfers with computation:

Batch N is on the GPU, model is computing
Batch N+1 is transferring via DMA (CPU is free)
Batch N+2 is being loaded from disk by worker processes

All three happen simultaneously. Without pinned memory, step 2 blocks the CPU, preventing step 3 from starting.

Interaction with DataLoader Settings

num_workers

pin_memory and num_workers are complementary:

num_workers > 0: data loading runs in separate processes
pin_memory=True: GPU transfer uses DMA instead of staged copy

The combination is powerful: workers prepare data in parallel, and pinned memory transfers it without blocking.

DataLoader(
    dataset,
    batch_size=64,
    num_workers=4,           # parallel data loading
    pin_memory=True,         # fast GPU transfer
    persistent_workers=True, # keep workers alive between epochs
    prefetch_factor=2,       # each worker prefetches 2 batches
)

prefetch_factor

prefetch_factor controls how many batches each worker prepares ahead. With pin_memory=True, pinned batches sit in RAM waiting for transfer.

# Memory impact:
# pinned_memory = num_workers * prefetch_factor * batch_memory
# Example: 4 workers * 2 prefetch * 50MB batch = 400MB pinned

If RAM is limited, reduce prefetch_factor before disabling pin_memory.

NUMA Considerations

On multi-socket servers, pinned memory is allocated on the NUMA node of the calling thread. If the GPU is on a different socket, DMA crosses the inter-socket link, adding latency.

# Check NUMA topology
numactl --hardware
# node 0: CPUs 0-31, GPU 0-3
# node 1: CPUs 32-63, GPU 4-7

# Pin process to correct NUMA node
numactl --cpunodebind=0 --membind=0 python train.py

This matters most for multi-GPU training. PyTorch’s DataLoader doesn’t manage NUMA affinity — use numactl or taskset at the process level.

When pin_memory Hurts

Memory Pressure

Pinned memory cannot be swapped. On a 32 GB system:

Pinned: 4 workers * 2 prefetch * 128 MB = 1 GB locked
Model + optimizer: ~4 GB
Python + libraries: ~2 GB
Available for OS: 32 - 7 = 25 GB  (fine)

But on a 16 GB system, the same config leaves only 9 GB. If dataset preprocessing allocates temporary buffers, the OS starts swapping OTHER processes — worse than the staging copy pin_memory was avoiding.

Monitoring

# Watch for swap during training
watch -n 1 'free -h | grep Swap'

# If swap used > 0, reduce pinned memory:
# 1. Lower num_workers or prefetch_factor
# 2. Reduce batch_size
# 3. Disable pin_memory (last resort)

Memory Considerations

Monitoring Usage

# Check pinned memory usage
import torch

# Current pinned memory allocated
print(f"Pinned: {torch.cuda.memory_stats()['pinned_memory_allocated'] / 1e9:.2f} GB")

Best Practices

# Good: Pin memory in DataLoader
loader = DataLoader(dataset, pin_memory=True)

# Bad: Manually pinning large datasets
big_data = torch.randn(1000000, 1000).pin_memory()  # Don't do this!

# Good: Pin only what's needed for transfer
batch = next(iter(loader))  # Already pinned
batch = {k: v.to('cuda', non_blocking=True) for k, v in batch.items()}

Implementation Details

CUDA’s cudaHostAlloc

Under the hood, PyTorch uses CUDA’s pinned memory allocator:

// PyTorch C++ implementation (simplified)
void* pin_memory(size_t size) {
    void* ptr;
    // cudaHostAllocDefault: standard page-locked memory
    // cudaHostAllocMapped: also maps to GPU address space
    cudaError_t err = cudaHostAlloc(&ptr, size, cudaHostAllocDefault);
    return ptr;
}

Memory Pool

PyTorch uses a pinned memory pool to avoid allocation overhead:

# First pin: allocates from CUDA
t1 = torch.randn(1000).pin_memory()

# Second pin: may reuse pooled memory
t2 = torch.randn(1000).pin_memory()

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./log'),
) as prof:
    for step, batch in enumerate(train_loader):
        images = batch['images'].to('cuda', non_blocking=True)
        labels = batch['labels'].to('cuda', non_blocking=True)
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        prof.step()

Look for cudaMemcpyAsync (pinned) vs cudaMemcpy (pageable) in the trace.

nvidia-smi

# Monitor GPU util during training
nvidia-smi dmon -s u -d 1
# If GPU util drops to 0% between batches, data loading is the bottleneck

Complete Example

import torch
from torch.utils.data import DataLoader

# Optimal configuration for GPU training
train_loader = DataLoader(
    train_dataset,
    batch_size=64,
    num_workers=4,
    pin_memory=True,          # Enable DMA transfers
    persistent_workers=True,   # Keep workers alive
    prefetch_factor=2          # Prefetch 2 batches per worker
)

# Training loop with non-blocking transfer
for batch in train_loader:
    # Non-blocking transfer (returns immediately)
    images = batch['images'].to('cuda', non_blocking=True)
    labels = batch['labels'].to('cuda', non_blocking=True)

    # CUDA operations automatically synchronize
    outputs = model(images)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()