Pinned Memory and DMA Transfers

Understanding PyTorch pin_memory for faster CPU to GPU data transfers using DMA (Direct Memory Access) and page-locked memory.

Best viewed on desktop for optimal interactive experience

Pinned Memory and DMA Transfers

The pin_memory=True parameter in PyTorch DataLoader enables faster CPU→GPU data transfers by using page-locked (pinned) memory. This seemingly simple flag can improve training throughput by 30-50% in I/O-bound workloads.

Interactive Visualization

pin_memory: Pageable vs Pinned Memory Transfer

pin_memory=False

~1.2ms
Pageable Memory

Standard allocation, can be swapped

Staging Buffer (Pinned)

Temporary buffer for DMA

GPU VRAM

Ready for training

Tensor allocated in standard pageable memory (can be swapped to disk)

pin_memory=True

~0.8ms
Pinned Memory

Page-locked, DMA accessible

DMA Direct
GPU VRAM

Ready for training

Tensor allocated in page-locked memory (cannot be swapped)

Without pin_memory
  • 2 memory copies (CPU → Staging → GPU)
  • CPU involved in first copy
  • Higher latency (~1.2ms per batch)
  • Less host memory used
With pin_memory
  • 1 memory copy (Direct DMA)
  • CPU not blocked during transfer
  • Lower latency (~0.8ms per batch)
  • More host memory used (page-locked)

Understanding Memory Types

Pageable Memory (Default)

When you allocate memory in Python, the operating system can swap it to disk:

# Standard Python/NumPy allocation uses pageable memory data = np.zeros((1000, 1000)) # Can be swapped to disk # PyTorch tensors also use pageable memory by default tensor = torch.zeros(1000, 1000) # Pageable

Characteristics:

  • Can be swapped to disk by OS
  • Virtual memory backed
  • GPU cannot access directly

Pinned Memory (Page-Locked)

Pinned memory is "locked" in physical RAM—the OS cannot swap it:

# Explicitly pinned tensor pinned_tensor = torch.zeros(1000, 1000).pin_memory() # Or via DataLoader loader = DataLoader(dataset, pin_memory=True)

Characteristics:

  • Locked in physical RAM
  • Cannot be swapped to disk
  • GPU can access via DMA

How DMA Works

Without Pinned Memory

CPU Process OS Kernel GPU │ │ │ │─────tensor───────►│ │ │ │ (check if swapped) │ │ │ │ │ │──copy to staging──►│ │ │ (pinned buffer) │ │ │ │ │ │ DMA────────► │ │ │ │ │◄─────done────────│◄──────────────────│

The CPU must:

  1. Check if pages are in RAM (page fault if swapped)
  2. Copy data to a pinned staging buffer
  3. Wait for DMA to complete

With Pinned Memory

CPU Process DMA Controller GPU │ │ │ │─────────────────►│────────────────► │ │ (DMA request) │ (direct copy) │ │ │ │ │◄────────────────done─────────────────│

Benefits:

  1. No staging buffer needed
  2. CPU can continue other work
  3. DMA transfers asynchronously

The Code Behind It

PyTorch's Implementation

When pin_memory=True, the DataLoader calls pin_memory() on each batch:

# Simplified DataLoader logic def fetch_batch(): batch = collate_fn(samples) if self.pin_memory: batch = pin_memory(batch) return batch def pin_memory(data): if isinstance(data, torch.Tensor): return data.pin_memory() elif isinstance(data, dict): return {k: pin_memory(v) for k, v in data.items()} elif isinstance(data, (list, tuple)): return type(data)(pin_memory(v) for v in data) return data

Async Transfer with Pinned Memory

Pinned memory enables truly asynchronous transfers:

# Create CUDA stream stream = torch.cuda.Stream() # Synchronous (blocks until complete) data_gpu = data_cpu.to('cuda') # Asynchronous with pinned memory data_pinned = data_cpu.pin_memory() with torch.cuda.stream(stream): data_gpu = data_pinned.to('cuda', non_blocking=True) # CPU continues immediately, transfer happens in background

Performance Impact

When It Helps

ScenarioWithout pin_memoryWith pin_memoryImprovement
Large batches1.2ms/batch0.8ms/batch33%
High GPU util85%95%12%
Throughput100 samples/s140 samples/s40%

When It Doesn't Help

  • Small tensors: DMA overhead dominates
  • CPU-bound training: GPU isn't waiting for data
  • Memory pressure: Pinning uses more RAM

Memory Considerations

The Trade-off

Pinned memory has a cost:

# Each pinned allocation reduces available system memory # OS cannot swap pinned pages # Example: 8GB of pinned data on 16GB system # Only 8GB left for everything else!

Monitoring Usage

# Check pinned memory usage import torch # Current pinned memory allocated print(f"Pinned: {torch.cuda.memory_stats()['pinned_memory_allocated'] / 1e9:.2f} GB")

Best Practices

# Good: Pin memory in DataLoader loader = DataLoader(dataset, pin_memory=True) # Bad: Manually pinning large datasets big_data = torch.randn(1000000, 1000).pin_memory() # Don't do this! # Good: Pin only what's needed for transfer batch = next(iter(loader)) # Already pinned batch = {k: v.to('cuda', non_blocking=True) for k, v in batch.items()}

Implementation Details

CUDA's cudaHostAlloc

Under the hood, PyTorch uses CUDA's pinned memory allocator:

// PyTorch C++ implementation (simplified) void* pin_memory(size_t size) { void* ptr; // cudaHostAllocDefault: standard page-locked memory // cudaHostAllocMapped: also maps to GPU address space cudaError_t err = cudaHostAlloc(&ptr, size, cudaHostAllocDefault); return ptr; }

Memory Pool

PyTorch uses a pinned memory pool to avoid allocation overhead:

# First pin: allocates from CUDA t1 = torch.randn(1000).pin_memory() # Second pin: may reuse pooled memory t2 = torch.randn(1000).pin_memory()

When to Use pin_memory

Use It When:

  1. GPU training with DataLoader - Almost always beneficial
  2. Large batch transfers - More data = more benefit from DMA
  3. GPU-bound training - Keeps GPU fed with data
  4. Multi-GPU training - Each GPU needs fast data access

Skip It When:

  1. CPU-only training - No GPU transfers
  2. Very small batches - DMA overhead dominates
  3. Memory constrained - Need all RAM for other purposes
  4. Custom data pipeline - Not using DataLoader's pin_memory_queue

Complete Example

import torch from torch.utils.data import DataLoader # Optimal configuration for GPU training train_loader = DataLoader( train_dataset, batch_size=64, num_workers=4, pin_memory=True, # Enable DMA transfers persistent_workers=True, # Keep workers alive prefetch_factor=2 # Prefetch 2 batches per worker ) # Training loop with non-blocking transfer for batch in train_loader: # Non-blocking transfer (returns immediately) images = batch['images'].to('cuda', non_blocking=True) labels = batch['labels'].to('cuda', non_blocking=True) # CUDA operations automatically synchronize outputs = model(images) loss = criterion(outputs, labels) loss.backward() optimizer.step()

If you found this explanation helpful, consider sharing it with others.

Mastodon