Xid 31 MMU Faults: What Causes Them and How to Fix Production GPU Crashes

An NVIDIA Xid 31 MMU fault is a GPU page-table-walk failure that crashes the active CUDA context. This article explains what triggers Xid 31, how to decode every field in the kernel-log error message, and how we eliminated 28 daily Xid 31 crashes in a production video pipeline processing 7,000+ videos.

Your GPU inference pipeline is processing thousands of videos per day. Then you see this in journalctl:

NVRM: Xid (PCI:0000:c1:00): 31, pid=2646416, name=python,
  channel 0x0000001a, intr 00000000.
  MMU Fault: ENGINE GRAPHICS GPC10 GPCCLIENT_T1_5
  faulted @ 0x725f_fb800000.
  Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

The CUDA context is dead. Your pipeline resets. And you have no idea why.

This article decodes every field in that error message, explains the GPU virtual memory system that produced it, and walks through the multi-allocator race condition that causes these faults in production video pipelines. We went from 28 crashes per day to zero — and this is exactly how.

The Error Message

Every field in an Xid 31 error carries diagnostic information. The challenge is knowing how to read it. Click any highlighted field below to see what it means and how to use it for debugging.

Part 1: The GPU’s Virtual Memory System

Why GPUs Have Virtual Memory

Modern NVIDIA GPUs (Pascal and later) implement a full hardware Memory Management Unit (MMU), similar to a CPU’s MMU. This unified memory architecture means every CUDA context gets its own 49-bit virtual address space — 512 TB of addressable memory. This is more than enough to cover all physical GPU memory plus all system memory combined.

When a CUDA kernel, TensorRT engine, or NVDEC decoder accesses memory, it doesn’t use physical addresses directly. It uses virtual addresses. The GPU MMU translates these virtual addresses to physical DRAM addresses on every memory access, just like a CPU.

The Page Table Walk

The GPU MMU uses a multi-level page table to translate virtual addresses. On Pascal+ GPUs, this is a 5-level hierarchy. Each level is a 4 KB table with 512 entries. Each entry (PDE — Page Directory Entry) either points to the next level of the hierarchy, maps a large page directly (2 MB or larger), or is empty/invalid — meaning no mapping exists for that address range.

At the bottom level, a PTE (Page Table Entry) maps a 4 KB or 64 KB page of virtual address space to a physical page in GPU DRAM.

FAULT_PDE vs FAULT_PTE

When the Xid error says FAULT_PDE, it means the MMU walked the page table hierarchy and found an empty Page Directory Entry — one of the intermediate levels had no mapping. The GPU literally cannot translate the virtual address because the page table entry that should point to the next level doesn’t exist.

This is different from FAULT_PTE, which means the MMU made it all the way to the final level but found no physical page mapping. Both are fatal, but FAULT_PDE typically indicates a larger region of memory was unmapped (at least 2 MB at once), while FAULT_PTE could be a single 4 KB page.

Key Diagnostic Insight

FAULT_PDE means wholesale deallocation — an entire memory region had its page directory entry removed. This points to an allocator freeing a large block, not a single-page corruption. Look for pool teardown operations like free_all_blocks() or empty_cache().

Part 2: Decoding the Fault Address

The 49-Bit Virtual Address Space

CUDA’s 49-bit virtual address space (0x0 through 0x1_FFFF_FFFF_FFFF) is divided into regions by the CUDA driver and GPU memory allocators. Low addresses are used for driver internals and small allocations. High addresses are where the action — and the danger — lies: large tensors, TensorRT workspace, NVDEC surfaces, and CuPy pool blocks all compete for space.

Our fault address is 0x725f_fb800000. The upper bits (0x725f) place this at roughly 45% into the 49-bit address space. The alignment (800000 = 8 MB aligned) matches large GPU allocations. This is firmly in the contested high-address zone where multiple allocators compete.

Who Lives at These High Addresses?

In a multi-component GPU pipeline, several allocators carve out regions of the virtual address space independently:

Allocator	What It Allocates
CUDA driver internals	Page tables, channel descriptors
PyTorch caching allocator	Tensors, model weights, intermediates
CuPy memory pool	Kernel outputs, NV12→RGB results
TensorRT workspace	Engine workspace, activation memory
NVDEC mapped surfaces	Decoded video frame DPB surfaces

The address 0x725f_fb800000 is consistent with a large allocation from PyTorch, CuPy, or TensorRT — the kind of memory that gets allocated and freed repeatedly during video inference.

Part 3: Who Faulted and Why

ENGINE GRAPHICS GPC10 GPCCLIENT_T1_5

This identifies exactly which hardware unit on the GPU tried to access the invalid address. ENGINE GRAPHICS means a CUDA kernel was executing — not a memory copy or NVDEC decode operation. GPC10 is Graphics Processing Cluster #10, one of many parallel compute clusters on the die. GPCCLIENT_T1_5 is Texture Unit Level 1, pipe 5 — the hardware unit that handles memory loads for CUDA kernels.

When a CUDA kernel reads from global memory (e.g., reading a tensor element), the request goes through the T1 texture path. This tells us a compute kernel was trying to read data that used to exist but no longer does.

ACCESS_TYPE_VIRT_READ

VIRT_READ means a compute kernel tried to read from the faulting address. This is distinct from VIRT_WRITE, which would indicate a DMA engine trying to write decoded pixels into a surface. VIRT_READ faults are the hallmark of use-after-free: the data existed once, the kernel has a valid pointer, but the backing pages were removed between pointer capture and access.

The Timeline of a Use-After-Free Fault

Allocator creates memory at 0x725f... → PDE chain created, address is valid. 2. Tensor/buffer lives at this address → kernels can read/write successfully. 3. Allocator frees the memory (pool teardown) → PDE chain torn down, address is invalid. 4. Another kernel still holds the pointer → issues load instruction → MMU walks page table → finds empty PDE → Xid 31.

Part 4: Why This Happens in GPU Video Pipelines

The Multi-Allocator Problem

A GPU video inference pipeline has at least four independent memory allocators sharing one 49-bit virtual address space:

PyTorch caching allocator — manages tensor memory with lazy deallocation
CuPy memory pool — manages CuPy array memory separately from PyTorch
TensorRT workspace — pre-allocated scratch space for inference kernels
NVDEC DPB surfaces — hardware-managed decoded frame buffers

Each allocator has its own view of what memory is “in use” vs “free.” The danger arises when allocators share memory across boundaries without coordinating lifecycle — especially through DLPack zero-copy transfers.

The DLPack Zero-Copy Trap

DLPack enables zero-copy sharing between frameworks — torch.from_dlpack(cupy_array) creates a PyTorch tensor that points to CuPy-owned memory without copying. This is fast but dangerous:

CuPy’s memory pool tracks the allocation
PyTorch’s tensor holds a pointer to the same physical memory
If CuPy’s pool releases the block (via free_all_blocks()), the PDE is torn down
PyTorch’s tensor now points to invalid virtual memory
Next kernel that reads the tensor → FAULT_PDE → Xid 31

NVDEC Surface Mapping Lifecycle

NVDEC adds another layer of complexity. The cuvidMapVideoFrame() API maps decoded frames from the hardware DPB (Decoded Picture Buffer) into the CUDA virtual address space. The critical constraint: at most ulNumOutputSurfaces frames can be mapped simultaneously. If you hold too many surfaces mapped (via DLPack views, slow processing, etc.), cuvidMapVideoFrame() fails on the next frame. If you unmap too eagerly (via garbage collection of DLPack capsules), in-flight CUDA kernels lose their source data.

The DLPack Trap in Practice

DLPack zero-copy is not free. Every zero-copy handoff creates a shared-ownership problem. If either side frees the memory, the other gets an Xid 31. In production pipelines, always clone tensors across allocator boundaries.

Part 5: How We Fixed It

We encountered three distinct Xid 31 variants over two days of debugging:

Day	Count	Type	Address Range	Root Cause
Day 0	28	VIRT_WRITE	Various	NVDEC DPB exhaustion — too many RGB surfaces (3 B/px) held simultaneously
Day 1	2	VIRT_READ	High (`0x7b77`, `0x0_48`)	CuPy kernel reading directly from NVDEC surface — cross-engine hazard
Day 1	3	VIRT_READ	High (`0x7f91`, `0x7f3c`, `0x7e64`)	CuPy pool teardown racing with DLPack-backed PyTorch tensors
Day 2	1	VIRT_READ	High (`0x725f`)	Residual allocator race — same mechanism as Day 1 batch 3

Fix 1: NV12 Output Mode (28 → 2 Xid/day)

Switched NVDEC from RGB output (3 bytes/pixel) to NV12 native output (1.5 bytes/pixel). This halved the DPB surface memory footprint, eliminating the VIRT_WRITE faults caused by DPB exhaustion. The hardware decoder uses half the memory per surface, so more frames fit within the DPB limit simultaneously.

Fix 2: Two-Phase Clone (2 → 3 Xid/day, different type)

Cloned NV12 surfaces out of NVDEC DPB memory before running the CuPy conversion kernel. This eliminated the cross-engine hazard where a GPC compute kernel read directly from NVDEC-mapped surfaces. The crash count slightly increased (3 vs 2) because a different race surfaced: CuPy free_all_blocks() during RGB conversion.

Fix 3: Clone RGB Output + Remove free_all_blocks() (3 → 0 Xid/day)

Two changes that together eliminated the allocator race:

Clone CuPy RGB output into PyTorch memory — the final tensor lives in PyTorch’s caching allocator, completely decoupled from CuPy’s pool
Remove free_all_blocks() from chunk cleanup — CuPy’s pool manages reuse internally; aggressive teardown was racing with pending DLPack deleters

Results

After all three fixes: 0 Xid 31 faults across 7,000+ videos processed, sustaining ~630 videos/minute on 4 inference workers. The complete data path is now fully isolated — NVDEC surfaces are cloned into PyTorch-owned memory immediately after decode, and no allocator can tear down pages that another framework references.

Quick Reference for Xid 31 Diagnosis

Error Message Field Reference

Field	Meaning
`PCI:0000:c1:00`	GPU PCI bus address
`31`	Xid error code — MMU fault
`pid=2646416`	Linux process ID that triggered the fault
`channel 0x1a`	GPU hardware channel (execution queue)
`ENGINE GRAPHICS`	Fault from compute/graphics engine (not COPY or NVDEC)
`GPC10`	Graphics Processing Cluster #10
`GPCCLIENT_T1_5`	Texture unit #5 within GPC10 (handles memory loads)
`0x725f_fb800000`	Virtual address that failed translation
`FAULT_PDE`	Page Directory Entry missing (large region unmapped)
`ACCESS_TYPE_VIRT_READ`	A kernel tried to read from the address

Fault Type Comparison

Fault	Level	Unmapped Region	Typical Cause
FAULT_PDE	Intermediate page directory	≥ 2 MB	Large allocation freed, pool teardown
FAULT_PTE	Final page table	4 KB page	Single page evicted or corrupted

Access Type Comparison

Access	Engine	Meaning
VIRT_WRITE	Usually NVDEC DMA	Hardware engine writing to unmapped surface
VIRT_READ	Usually GPC/SM	Compute kernel reading from freed memory

Address Range Heuristics

Address Range	Likely Source
`0x0` – `0x1_00000000`	Driver internals, channel memory
`0x1` – `0x10_00000000`	Small CUDA allocations, constants
`0x10` – `0x7FFF_FFFFFFFF`	Large allocations: tensors, TRT workspace, NVDEC, CuPy