The pin_memory=True parameter in PyTorch DataLoader enables faster CPU→GPU data transfers by using page-locked (pinned) memory. This seemingly simple flag can improve training throughput by 30-50% in I/O-bound workloads.
Interactive Visualization
Understanding Memory Types
Pageable Memory (Default)
When you allocate memory in Python, the operating system can swap it to disk:
# Standard Python/NumPy allocation uses pageable memory data = np.zeros((1000, 1000)) # Can be swapped to disk # PyTorch tensors also use pageable memory by default tensor = torch.zeros(1000, 1000) # Pageable
Characteristics:
- Can be swapped to disk by OS
- Virtual memory backed
- GPU cannot access directly
Pinned Memory (Page-Locked)
Pinned memory is “locked” in physical RAM — the OS cannot swap it:
# Explicitly pinned tensor pinned_tensor = torch.zeros(1000, 1000).pin_memory() # Or via DataLoader loader = DataLoader(dataset, pin_memory=True)
Characteristics:
- Locked in physical RAM
- Cannot be swapped to disk
- GPU can access via DMA
How DMA Transfers Work
Without pinned memory, the CPU must copy data to a temporary pinned staging buffer before DMA can transfer it to the GPU. This means two copies: RAM → staging buffer → GPU VRAM. The CPU is blocked during the staging copy.
With pinned memory, the DMA controller reads directly from the pinned allocation. One copy: RAM → GPU VRAM. The CPU is free to do other work (like preparing the next batch) while the transfer runs asynchronously.
The Code Behind It
PyTorch’s Implementation
When pin_memory=True, the DataLoader calls pin_memory() on each batch:
# Simplified DataLoader logic def fetch_batch(): batch = collate_fn(samples) if self.pin_memory: batch = pin_memory(batch) return batch def pin_memory(data): if isinstance(data, torch.Tensor): return data.pin_memory() elif isinstance(data, dict): return {k: pin_memory(v) for k, v in data.items()} elif isinstance(data, (list, tuple)): return type(data)(pin_memory(v) for v in data) return data
The Pinning Cost
Calling .pin_memory() is not free — it calls cudaHostAlloc() which:
- Allocates page-aligned memory
- Calls
mlock()to prevent the OS from swapping those pages - Registers the memory with the CUDA driver for DMA access
This takes 10-100µs per call, much slower than a regular malloc. That’s why PyTorch’s DataLoader pins memory in a dedicated thread (pin_memory_thread) rather than in the main training loop.
# PyTorch uses a separate thread for pinning: def _pin_memory_loop(in_queue, out_queue, device_id): while True: batch = in_queue.get() batch = pin_memory(batch) # 10-100us, hidden behind data loading out_queue.put(batch)
Async Transfer with Pinned Memory
Pinned memory enables truly asynchronous transfers:
# Create CUDA stream stream = torch.cuda.Stream() # Synchronous (blocks until complete) data_gpu = data_cpu.to('cuda') # Asynchronous with pinned memory data_pinned = data_cpu.pin_memory() with torch.cuda.stream(stream): data_gpu = data_pinned.to('cuda', non_blocking=True) # CPU continues immediately, transfer happens in background
Performance: When Pinning Helps
The throughput impact of pin_memory depends on batch size, GPU compute time, and whether data loading overlaps with compute.
The Overlap Principle
The real power of pinned memory is overlapping transfers with computation:
- Batch N is on the GPU, model is computing
- Batch N+1 is transferring via DMA (CPU is free)
- Batch N+2 is being loaded from disk by worker processes
All three happen simultaneously. Without pinned memory, step 2 blocks the CPU, preventing step 3 from starting.
Interaction with DataLoader Settings
num_workers
pin_memory and num_workers are complementary:
num_workers > 0: data loading runs in separate processespin_memory=True: GPU transfer uses DMA instead of staged copy
The combination is powerful: workers prepare data in parallel, and pinned memory transfers it without blocking.
DataLoader( dataset, batch_size=64, num_workers=4, # parallel data loading pin_memory=True, # fast GPU transfer persistent_workers=True, # keep workers alive between epochs prefetch_factor=2, # each worker prefetches 2 batches )
prefetch_factor
prefetch_factor controls how many batches each worker prepares ahead. With pin_memory=True, pinned batches sit in RAM waiting for transfer.
# Memory impact: # pinned_memory = num_workers * prefetch_factor * batch_memory # Example: 4 workers * 2 prefetch * 50MB batch = 400MB pinned
If RAM is limited, reduce prefetch_factor before disabling pin_memory.
NUMA Considerations
On multi-socket servers, pinned memory is allocated on the NUMA node of the calling thread. If the GPU is on a different socket, DMA crosses the inter-socket link, adding latency.
# Check NUMA topology numactl --hardware # node 0: CPUs 0-31, GPU 0-3 # node 1: CPUs 32-63, GPU 4-7 # Pin process to correct NUMA node numactl --cpunodebind=0 --membind=0 python train.py
This matters most for multi-GPU training. PyTorch’s DataLoader doesn’t manage NUMA affinity — use numactl or taskset at the process level.
When pin_memory Hurts
Memory Pressure
Pinned memory cannot be swapped. On a 32 GB system:
Pinned: 4 workers * 2 prefetch * 128 MB = 1 GB locked Model + optimizer: ~4 GB Python + libraries: ~2 GB Available for OS: 32 - 7 = 25 GB (fine)
But on a 16 GB system, the same config leaves only 9 GB. If dataset preprocessing allocates temporary buffers, the OS starts swapping OTHER processes — worse than the staging copy pin_memory was avoiding.
Monitoring
# Watch for swap during training watch -n 1 'free -h | grep Swap' # If swap used > 0, reduce pinned memory: # 1. Lower num_workers or prefetch_factor # 2. Reduce batch_size # 3. Disable pin_memory (last resort)
Memory Considerations
Monitoring Usage
# Check pinned memory usage import torch # Current pinned memory allocated print(f"Pinned: {torch.cuda.memory_stats()['pinned_memory_allocated'] / 1e9:.2f} GB")
Best Practices
# Good: Pin memory in DataLoader loader = DataLoader(dataset, pin_memory=True) # Bad: Manually pinning large datasets big_data = torch.randn(1000000, 1000).pin_memory() # Don't do this! # Good: Pin only what's needed for transfer batch = next(iter(loader)) # Already pinned batch = {k: v.to('cuda', non_blocking=True) for k, v in batch.items()}
Implementation Details
CUDA’s cudaHostAlloc
Under the hood, PyTorch uses CUDA’s pinned memory allocator:
// PyTorch C++ implementation (simplified) void* pin_memory(size_t size) { void* ptr; // cudaHostAllocDefault: standard page-locked memory // cudaHostAllocMapped: also maps to GPU address space cudaError_t err = cudaHostAlloc(&ptr, size, cudaHostAllocDefault); return ptr; }
Memory Pool
PyTorch uses a pinned memory pool to avoid allocation overhead:
# First pin: allocates from CUDA t1 = torch.randn(1000).pin_memory() # Second pin: may reuse pooled memory t2 = torch.randn(1000).pin_memory()
Should You Use pin_memory?
Profiling Your Pipeline
torch.profiler
with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ], schedule=torch.profiler.schedule(wait=1, warmup=1, active=3), on_trace_ready=torch.profiler.tensorboard_trace_handler('./log'), ) as prof: for step, batch in enumerate(train_loader): images = batch['images'].to('cuda', non_blocking=True) labels = batch['labels'].to('cuda', non_blocking=True) outputs = model(images) loss = criterion(outputs, labels) loss.backward() optimizer.step() prof.step()
Look for cudaMemcpyAsync (pinned) vs cudaMemcpy (pageable) in the trace.
nvidia-smi
# Monitor GPU util during training nvidia-smi dmon -s u -d 1 # If GPU util drops to 0% between batches, data loading is the bottleneck
Complete Example
import torch from torch.utils.data import DataLoader # Optimal configuration for GPU training train_loader = DataLoader( train_dataset, batch_size=64, num_workers=4, pin_memory=True, # Enable DMA transfers persistent_workers=True, # Keep workers alive prefetch_factor=2 # Prefetch 2 batches per worker ) # Training loop with non-blocking transfer for batch in train_loader: # Non-blocking transfer (returns immediately) images = batch['images'].to('cuda', non_blocking=True) labels = batch['labels'].to('cuda', non_blocking=True) # CUDA operations automatically synchronize outputs = model(images) loss = criterion(outputs, labels) loss.backward() optimizer.step()
Further Reading
- PyTorch DataLoader Documentation - Official docs on pin_memory and data loading
- CUDA Programming Guide: Pinned Memory - NVIDIA’s reference on cudaHostAlloc and page-locked memory
- PyTorch Performance Tuning Guide - Official guide covering pin_memory, num_workers, and data pipeline optimization
Related concepts
Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance
Master Structure of Arrays (SoA) vs Array of Structures (AoS) data layouts for optimal cache efficiency, SIMD vectorization, and GPU memory coalescing.
PyTorch DataLoader deep dive — Dataset, Sampler, Workers, Collate internals, num_workers throughput profiling, memory analysis, serialization costs, production patterns (LMDB, WebDataset), and bottleneck diagnosis.
Deep dive into PyTorch DataLoader num_workers parameter: how parallel workers prefetch data, optimal configuration, and common pitfalls.
How CUDA contexts, streams, and MPS compare: a context is a per-process container of GPU state, a stream is an in-order queue inside a context, and MPS lets multiple processes share a single GPU concurrently. Three layers, three different problems.
Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.
