Skip to main content

Pinned Memory and DMA Transfers in PyTorch

Complete guide to PyTorch pin_memory — how DMA transfers work, when pinning helps vs hurts, NUMA effects, profiling with torch.profiler, num_workers interaction, and debugging slow data loading.

16 min|pytorchgpumemorydmacudaperformance
Best viewed on desktop for optimal interactive experience

The pin_memory=True parameter in PyTorch DataLoader enables faster CPU→GPU data transfers by using page-locked (pinned) memory. This seemingly simple flag can improve training throughput by 30-50% in I/O-bound workloads.

Interactive Visualization

Understanding Memory Types

Pageable Memory (Default)

When you allocate memory in Python, the operating system can swap it to disk:

# Standard Python/NumPy allocation uses pageable memory data = np.zeros((1000, 1000)) # Can be swapped to disk # PyTorch tensors also use pageable memory by default tensor = torch.zeros(1000, 1000) # Pageable

Characteristics:

  • Can be swapped to disk by OS
  • Virtual memory backed
  • GPU cannot access directly

Pinned Memory (Page-Locked)

Pinned memory is “locked” in physical RAM — the OS cannot swap it:

# Explicitly pinned tensor pinned_tensor = torch.zeros(1000, 1000).pin_memory() # Or via DataLoader loader = DataLoader(dataset, pin_memory=True)

Characteristics:

  • Locked in physical RAM
  • Cannot be swapped to disk
  • GPU can access via DMA

How DMA Transfers Work

Without pinned memory, the CPU must copy data to a temporary pinned staging buffer before DMA can transfer it to the GPU. This means two copies: RAM → staging buffer → GPU VRAM. The CPU is blocked during the staging copy.

With pinned memory, the DMA controller reads directly from the pinned allocation. One copy: RAM → GPU VRAM. The CPU is free to do other work (like preparing the next batch) while the transfer runs asynchronously.

The Code Behind It

PyTorch’s Implementation

When pin_memory=True, the DataLoader calls pin_memory() on each batch:

# Simplified DataLoader logic def fetch_batch(): batch = collate_fn(samples) if self.pin_memory: batch = pin_memory(batch) return batch def pin_memory(data): if isinstance(data, torch.Tensor): return data.pin_memory() elif isinstance(data, dict): return {k: pin_memory(v) for k, v in data.items()} elif isinstance(data, (list, tuple)): return type(data)(pin_memory(v) for v in data) return data

The Pinning Cost

Calling .pin_memory() is not free — it calls cudaHostAlloc() which:

  1. Allocates page-aligned memory
  2. Calls mlock() to prevent the OS from swapping those pages
  3. Registers the memory with the CUDA driver for DMA access

This takes 10-100µs per call, much slower than a regular malloc. That’s why PyTorch’s DataLoader pins memory in a dedicated thread (pin_memory_thread) rather than in the main training loop.

# PyTorch uses a separate thread for pinning: def _pin_memory_loop(in_queue, out_queue, device_id): while True: batch = in_queue.get() batch = pin_memory(batch) # 10-100us, hidden behind data loading out_queue.put(batch)

Async Transfer with Pinned Memory

Pinned memory enables truly asynchronous transfers:

# Create CUDA stream stream = torch.cuda.Stream() # Synchronous (blocks until complete) data_gpu = data_cpu.to('cuda') # Asynchronous with pinned memory data_pinned = data_cpu.pin_memory() with torch.cuda.stream(stream): data_gpu = data_pinned.to('cuda', non_blocking=True) # CPU continues immediately, transfer happens in background

Performance: When Pinning Helps

The throughput impact of pin_memory depends on batch size, GPU compute time, and whether data loading overlaps with compute.

The Overlap Principle

The real power of pinned memory is overlapping transfers with computation:

  1. Batch N is on the GPU, model is computing
  2. Batch N+1 is transferring via DMA (CPU is free)
  3. Batch N+2 is being loaded from disk by worker processes

All three happen simultaneously. Without pinned memory, step 2 blocks the CPU, preventing step 3 from starting.

Interaction with DataLoader Settings

num_workers

pin_memory and num_workers are complementary:

  • num_workers > 0: data loading runs in separate processes
  • pin_memory=True: GPU transfer uses DMA instead of staged copy

The combination is powerful: workers prepare data in parallel, and pinned memory transfers it without blocking.

DataLoader( dataset, batch_size=64, num_workers=4, # parallel data loading pin_memory=True, # fast GPU transfer persistent_workers=True, # keep workers alive between epochs prefetch_factor=2, # each worker prefetches 2 batches )

prefetch_factor

prefetch_factor controls how many batches each worker prepares ahead. With pin_memory=True, pinned batches sit in RAM waiting for transfer.

# Memory impact: # pinned_memory = num_workers * prefetch_factor * batch_memory # Example: 4 workers * 2 prefetch * 50MB batch = 400MB pinned

If RAM is limited, reduce prefetch_factor before disabling pin_memory.

NUMA Considerations

On multi-socket servers, pinned memory is allocated on the NUMA node of the calling thread. If the GPU is on a different socket, DMA crosses the inter-socket link, adding latency.

# Check NUMA topology numactl --hardware # node 0: CPUs 0-31, GPU 0-3 # node 1: CPUs 32-63, GPU 4-7 # Pin process to correct NUMA node numactl --cpunodebind=0 --membind=0 python train.py

This matters most for multi-GPU training. PyTorch’s DataLoader doesn’t manage NUMA affinity — use numactl or taskset at the process level.

When pin_memory Hurts

Memory Pressure

Pinned memory cannot be swapped. On a 32 GB system:

Pinned: 4 workers * 2 prefetch * 128 MB = 1 GB locked Model + optimizer: ~4 GB Python + libraries: ~2 GB Available for OS: 32 - 7 = 25 GB (fine)

But on a 16 GB system, the same config leaves only 9 GB. If dataset preprocessing allocates temporary buffers, the OS starts swapping OTHER processes — worse than the staging copy pin_memory was avoiding.

Monitoring

# Watch for swap during training watch -n 1 'free -h | grep Swap' # If swap used > 0, reduce pinned memory: # 1. Lower num_workers or prefetch_factor # 2. Reduce batch_size # 3. Disable pin_memory (last resort)

Memory Considerations

Monitoring Usage

# Check pinned memory usage import torch # Current pinned memory allocated print(f"Pinned: {torch.cuda.memory_stats()['pinned_memory_allocated'] / 1e9:.2f} GB")

Best Practices

# Good: Pin memory in DataLoader loader = DataLoader(dataset, pin_memory=True) # Bad: Manually pinning large datasets big_data = torch.randn(1000000, 1000).pin_memory() # Don't do this! # Good: Pin only what's needed for transfer batch = next(iter(loader)) # Already pinned batch = {k: v.to('cuda', non_blocking=True) for k, v in batch.items()}

Implementation Details

CUDA’s cudaHostAlloc

Under the hood, PyTorch uses CUDA’s pinned memory allocator:

// PyTorch C++ implementation (simplified) void* pin_memory(size_t size) { void* ptr; // cudaHostAllocDefault: standard page-locked memory // cudaHostAllocMapped: also maps to GPU address space cudaError_t err = cudaHostAlloc(&ptr, size, cudaHostAllocDefault); return ptr; }

Memory Pool

PyTorch uses a pinned memory pool to avoid allocation overhead:

# First pin: allocates from CUDA t1 = torch.randn(1000).pin_memory() # Second pin: may reuse pooled memory t2 = torch.randn(1000).pin_memory()

Should You Use pin_memory?

Profiling Your Pipeline

torch.profiler

with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ], schedule=torch.profiler.schedule(wait=1, warmup=1, active=3), on_trace_ready=torch.profiler.tensorboard_trace_handler('./log'), ) as prof: for step, batch in enumerate(train_loader): images = batch['images'].to('cuda', non_blocking=True) labels = batch['labels'].to('cuda', non_blocking=True) outputs = model(images) loss = criterion(outputs, labels) loss.backward() optimizer.step() prof.step()

Look for cudaMemcpyAsync (pinned) vs cudaMemcpy (pageable) in the trace.

nvidia-smi

# Monitor GPU util during training nvidia-smi dmon -s u -d 1 # If GPU util drops to 0% between batches, data loading is the bottleneck

Complete Example

import torch from torch.utils.data import DataLoader # Optimal configuration for GPU training train_loader = DataLoader( train_dataset, batch_size=64, num_workers=4, pin_memory=True, # Enable DMA transfers persistent_workers=True, # Keep workers alive prefetch_factor=2 # Prefetch 2 batches per worker ) # Training loop with non-blocking transfer for batch in train_loader: # Non-blocking transfer (returns immediately) images = batch['images'].to('cuda', non_blocking=True) labels = batch['labels'].to('cuda', non_blocking=True) # CUDA operations automatically synchronize outputs = model(images) loss = criterion(outputs, labels) loss.backward() optimizer.step()

Key Takeaways

  1. Pinned memory eliminates the staging copy — DMA transfers directly from RAM to GPU, saving one memcpy per batch.

  2. The real win is overlap — with non_blocking=True, CPU prepares the next batch while DMA transfers the current one.

  3. Almost always enable itpin_memory=True in DataLoader is free performance unless RAM is critically tight.

  4. Watch for memory pressure — pinned pages can’t be swapped. Monitor with free -h if you see unexpected swap usage.

  5. Profile before and after — use torch.profiler to verify transfers changed from cudaMemcpy to cudaMemcpyAsync.

Further Reading

If you found this explanation helpful, consider sharing it with others.

Mastodon