Skip to main content

Anatomy of a GPU Crash: Understanding Xid 31 MMU Faults

Deep dive into NVIDIA GPU Xid 31 MMU faults: how GPU virtual memory works, what causes page table walk failures, and how we eliminated 28 daily crashes in a production video pipeline processing 7,000+ videos.

Abhik SarkarAbhik Sarkar
25 min read|nvidiagpucudammu+11
Best viewed on desktop for optimal interactive experience

Your GPU inference pipeline is processing thousands of videos per day. Then you see this in journalctl:

NVRM: Xid (PCI:0000:c1:00): 31, pid=2646416, name=python, channel 0x0000001a, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC10 GPCCLIENT_T1_5 faulted @ 0x725f_fb800000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

The CUDA context is dead. Your pipeline resets. And you have no idea why.

This article decodes every field in that error message, explains the GPU virtual memory system that produced it, and walks through the multi-allocator race condition that causes these faults in production video pipelines. We went from 28 crashes per day to zero — and this is exactly how.

The Error Message

Every field in an Xid 31 error carries diagnostic information. The challenge is knowing how to read it. Click any highlighted field below to see what it means and how to use it for debugging.

Part 1: The GPU’s Virtual Memory System

Why GPUs Have Virtual Memory

Modern NVIDIA GPUs (Pascal and later) implement a full hardware Memory Management Unit (MMU), similar to a CPU’s MMU. This unified memory architecture means every CUDA context gets its own 49-bit virtual address space — 512 TB of addressable memory. This is more than enough to cover all physical GPU memory plus all system memory combined.

When a CUDA kernel, TensorRT engine, or NVDEC decoder accesses memory, it doesn’t use physical addresses directly. It uses virtual addresses. The GPU MMU translates these virtual addresses to physical DRAM addresses on every memory access, just like a CPU.

The Page Table Walk

The GPU MMU uses a multi-level page table to translate virtual addresses. On Pascal+ GPUs, this is a 5-level hierarchy. Each level is a 4 KB table with 512 entries. Each entry (PDE — Page Directory Entry) either points to the next level of the hierarchy, maps a large page directly (2 MB or larger), or is empty/invalid — meaning no mapping exists for that address range.

At the bottom level, a PTE (Page Table Entry) maps a 4 KB or 64 KB page of virtual address space to a physical page in GPU DRAM.

FAULT_PDE vs FAULT_PTE

When the Xid error says FAULT_PDE, it means the MMU walked the page table hierarchy and found an empty Page Directory Entry — one of the intermediate levels had no mapping. The GPU literally cannot translate the virtual address because the page table entry that should point to the next level doesn’t exist.

This is different from FAULT_PTE, which means the MMU made it all the way to the final level but found no physical page mapping. Both are fatal, but FAULT_PDE typically indicates a larger region of memory was unmapped (at least 2 MB at once), while FAULT_PTE could be a single 4 KB page.

Key Diagnostic Insight

FAULT_PDE means wholesale deallocation — an entire memory region had its page directory entry removed. This points to an allocator freeing a large block, not a single-page corruption. Look for pool teardown operations like free_all_blocks() or empty_cache().

Part 2: Decoding the Fault Address

The 49-Bit Virtual Address Space

CUDA’s 49-bit virtual address space (0x0 through 0x1_FFFF_FFFF_FFFF) is divided into regions by the CUDA driver and GPU memory allocators. Low addresses are used for driver internals and small allocations. High addresses are where the action — and the danger — lies: large tensors, TensorRT workspace, NVDEC surfaces, and CuPy pool blocks all compete for space.

Our fault address is 0x725f_fb800000. The upper bits (0x725f) place this at roughly 45% into the 49-bit address space. The alignment (800000 = 8 MB aligned) matches large GPU allocations. This is firmly in the contested high-address zone where multiple allocators compete.

Who Lives at These High Addresses?

In a multi-component GPU pipeline, several allocators carve out regions of the virtual address space independently:

AllocatorWhat It Allocates
CUDA driver internalsPage tables, channel descriptors
PyTorch caching allocatorTensors, model weights, intermediates
CuPy memory poolKernel outputs, NV12→RGB results
TensorRT workspaceEngine workspace, activation memory
NVDEC mapped surfacesDecoded video frame DPB surfaces

The address 0x725f_fb800000 is consistent with a large allocation from PyTorch, CuPy, or TensorRT — the kind of memory that gets allocated and freed repeatedly during video inference.

Part 3: Who Faulted and Why

ENGINE GRAPHICS GPC10 GPCCLIENT_T1_5

This identifies exactly which hardware unit on the GPU tried to access the invalid address. ENGINE GRAPHICS means a CUDA kernel was executing — not a memory copy or NVDEC decode operation. GPC10 is Graphics Processing Cluster #10, one of many parallel compute clusters on the die. GPCCLIENT_T1_5 is Texture Unit Level 1, pipe 5 — the hardware unit that handles memory loads for CUDA kernels.

When a CUDA kernel reads from global memory (e.g., reading a tensor element), the request goes through the T1 texture path. This tells us a compute kernel was trying to read data that used to exist but no longer does.

ACCESS_TYPE_VIRT_READ

VIRT_READ means a compute kernel tried to read from the faulting address. This is distinct from VIRT_WRITE, which would indicate a DMA engine trying to write decoded pixels into a surface. VIRT_READ faults are the hallmark of use-after-free: the data existed once, the kernel has a valid pointer, but the backing pages were removed between pointer capture and access.

The Timeline of a Use-After-Free Fault

  1. Allocator creates memory at 0x725f... → PDE chain created, address is valid. 2. Tensor/buffer lives at this address → kernels can read/write successfully. 3. Allocator frees the memory (pool teardown) → PDE chain torn down, address is invalid. 4. Another kernel still holds the pointer → issues load instruction → MMU walks page table → finds empty PDE → Xid 31.

Part 4: Why This Happens in GPU Video Pipelines

The Multi-Allocator Problem

A GPU video inference pipeline has at least four independent memory allocators sharing one 49-bit virtual address space:

  1. PyTorch caching allocator — manages tensor memory with lazy deallocation
  2. CuPy memory pool — manages CuPy array memory separately from PyTorch
  3. TensorRT workspace — pre-allocated scratch space for inference kernels
  4. NVDEC DPB surfaces — hardware-managed decoded frame buffers

Each allocator has its own view of what memory is “in use” vs “free.” The danger arises when allocators share memory across boundaries without coordinating lifecycle — especially through DLPack zero-copy transfers.

The DLPack Zero-Copy Trap

DLPack enables zero-copy sharing between frameworks — torch.from_dlpack(cupy_array) creates a PyTorch tensor that points to CuPy-owned memory without copying. This is fast but dangerous:

  • CuPy’s memory pool tracks the allocation
  • PyTorch’s tensor holds a pointer to the same physical memory
  • If CuPy’s pool releases the block (via free_all_blocks()), the PDE is torn down
  • PyTorch’s tensor now points to invalid virtual memory
  • Next kernel that reads the tensor → FAULT_PDE → Xid 31

NVDEC Surface Mapping Lifecycle

NVDEC adds another layer of complexity. The cuvidMapVideoFrame() API maps decoded frames from the hardware DPB (Decoded Picture Buffer) into the CUDA virtual address space. The critical constraint: at most ulNumOutputSurfaces frames can be mapped simultaneously. If you hold too many surfaces mapped (via DLPack views, slow processing, etc.), cuvidMapVideoFrame() fails on the next frame. If you unmap too eagerly (via garbage collection of DLPack capsules), in-flight CUDA kernels lose their source data.

The DLPack Trap in Practice

DLPack zero-copy is not free. Every zero-copy handoff creates a shared-ownership problem. If either side frees the memory, the other gets an Xid 31. In production pipelines, always clone tensors across allocator boundaries.

Part 5: How We Fixed It

We encountered three distinct Xid 31 variants over two days of debugging:

DayCountTypeAddress RangeRoot Cause
Day 028VIRT_WRITEVariousNVDEC DPB exhaustion — too many RGB surfaces (3 B/px) held simultaneously
Day 12VIRT_READHigh (0x7b77, 0x0_48)CuPy kernel reading directly from NVDEC surface — cross-engine hazard
Day 13VIRT_READHigh (0x7f91, 0x7f3c, 0x7e64)CuPy pool teardown racing with DLPack-backed PyTorch tensors
Day 21VIRT_READHigh (0x725f)Residual allocator race — same mechanism as Day 1 batch 3

Fix 1: NV12 Output Mode (28 → 2 Xid/day)

Switched NVDEC from RGB output (3 bytes/pixel) to NV12 native output (1.5 bytes/pixel). This halved the DPB surface memory footprint, eliminating the VIRT_WRITE faults caused by DPB exhaustion. The hardware decoder uses half the memory per surface, so more frames fit within the DPB limit simultaneously.

Fix 2: Two-Phase Clone (2 → 3 Xid/day, different type)

Cloned NV12 surfaces out of NVDEC DPB memory before running the CuPy conversion kernel. This eliminated the cross-engine hazard where a GPC compute kernel read directly from NVDEC-mapped surfaces. The crash count slightly increased (3 vs 2) because a different race surfaced: CuPy free_all_blocks() during RGB conversion.

Fix 3: Clone RGB Output + Remove free_all_blocks() (3 → 0 Xid/day)

Two changes that together eliminated the allocator race:

  1. Clone CuPy RGB output into PyTorch memory — the final tensor lives in PyTorch’s caching allocator, completely decoupled from CuPy’s pool
  2. Remove free_all_blocks() from chunk cleanup — CuPy’s pool manages reuse internally; aggressive teardown was racing with pending DLPack deleters

Results

After all three fixes: 0 Xid 31 faults across 7,000+ videos processed, sustaining ~630 videos/minute on 4 inference workers. The complete data path is now fully isolated — NVDEC surfaces are cloned into PyTorch-owned memory immediately after decode, and no allocator can tear down pages that another framework references.

Quick Reference for Xid 31 Diagnosis

Error Message Field Reference

FieldMeaning
PCI:0000:c1:00GPU PCI bus address
31Xid error code — MMU fault
pid=2646416Linux process ID that triggered the fault
channel 0x1aGPU hardware channel (execution queue)
ENGINE GRAPHICSFault from compute/graphics engine (not COPY or NVDEC)
GPC10Graphics Processing Cluster #10
GPCCLIENT_T1_5Texture unit #5 within GPC10 (handles memory loads)
0x725f_fb800000Virtual address that failed translation
FAULT_PDEPage Directory Entry missing (large region unmapped)
ACCESS_TYPE_VIRT_READA kernel tried to read from the address

Fault Type Comparison

FaultLevelUnmapped RegionTypical Cause
FAULT_PDEIntermediate page directory≥ 2 MBLarge allocation freed, pool teardown
FAULT_PTEFinal page table4 KB pageSingle page evicted or corrupted

Access Type Comparison

AccessEngineMeaning
VIRT_WRITEUsually NVDEC DMAHardware engine writing to unmapped surface
VIRT_READUsually GPC/SMCompute kernel reading from freed memory

Address Range Heuristics

Address RangeLikely Source
0x00x1_00000000Driver internals, channel memory
0x10x10_00000000Small CUDA allocations, constants
0x100x7FFF_FFFFFFFFLarge allocations: tensors, TRT workspace, NVDEC, CuPy

Sources

Abhik Sarkar

Abhik Sarkar

Machine Learning Consultant specializing in Computer Vision and Deep Learning. Leading ML teams and building innovative solutions.

Share this article

If you found this article helpful, consider sharing it with your network

Mastodon