Your GPU inference pipeline is processing thousands of videos per day. Then you see this in journalctl:
NVRM: Xid (PCI:0000:c1:00): 31, pid=2646416, name=python, channel 0x0000001a, intr 00000000. MMU Fault: ENGINE GRAPHICS GPC10 GPCCLIENT_T1_5 faulted @ 0x725f_fb800000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
The CUDA context is dead. Your pipeline resets. And you have no idea why.
This article decodes every field in that error message, explains the GPU virtual memory system that produced it, and walks through the multi-allocator race condition that causes these faults in production video pipelines. We went from 28 crashes per day to zero — and this is exactly how.
The Error Message
Every field in an Xid 31 error carries diagnostic information. The challenge is knowing how to read it. Click any highlighted field below to see what it means and how to use it for debugging.
Part 1: The GPU’s Virtual Memory System
Why GPUs Have Virtual Memory
Modern NVIDIA GPUs (Pascal and later) implement a full hardware Memory Management Unit (MMU), similar to a CPU’s MMU. This unified memory architecture means every CUDA context gets its own 49-bit virtual address space — 512 TB of addressable memory. This is more than enough to cover all physical GPU memory plus all system memory combined.
When a CUDA kernel, TensorRT engine, or NVDEC decoder accesses memory, it doesn’t use physical addresses directly. It uses virtual addresses. The GPU MMU translates these virtual addresses to physical DRAM addresses on every memory access, just like a CPU.
The Page Table Walk
The GPU MMU uses a multi-level page table to translate virtual addresses. On Pascal+ GPUs, this is a 5-level hierarchy. Each level is a 4 KB table with 512 entries. Each entry (PDE — Page Directory Entry) either points to the next level of the hierarchy, maps a large page directly (2 MB or larger), or is empty/invalid — meaning no mapping exists for that address range.
At the bottom level, a PTE (Page Table Entry) maps a 4 KB or 64 KB page of virtual address space to a physical page in GPU DRAM.
FAULT_PDE vs FAULT_PTE
When the Xid error says FAULT_PDE, it means the MMU walked the page table hierarchy and found an empty Page Directory Entry — one of the intermediate levels had no mapping. The GPU literally cannot translate the virtual address because the page table entry that should point to the next level doesn’t exist.
This is different from FAULT_PTE, which means the MMU made it all the way to the final level but found no physical page mapping. Both are fatal, but FAULT_PDE typically indicates a larger region of memory was unmapped (at least 2 MB at once), while FAULT_PTE could be a single 4 KB page.
Key Diagnostic Insight
FAULT_PDE means wholesale deallocation — an entire memory region had its
page directory entry removed. This points to an allocator freeing a large
block, not a single-page corruption. Look for pool teardown operations like
free_all_blocks() or empty_cache().
Part 2: Decoding the Fault Address
The 49-Bit Virtual Address Space
CUDA’s 49-bit virtual address space (0x0 through 0x1_FFFF_FFFF_FFFF) is divided into regions by the CUDA driver and GPU memory allocators. Low addresses are used for driver internals and small allocations. High addresses are where the action — and the danger — lies: large tensors, TensorRT workspace, NVDEC surfaces, and CuPy pool blocks all compete for space.
Our fault address is 0x725f_fb800000. The upper bits (0x725f) place this at roughly 45% into the 49-bit address space. The alignment (800000 = 8 MB aligned) matches large GPU allocations. This is firmly in the contested high-address zone where multiple allocators compete.
Who Lives at These High Addresses?
In a multi-component GPU pipeline, several allocators carve out regions of the virtual address space independently:
| Allocator | What It Allocates |
|---|---|
| CUDA driver internals | Page tables, channel descriptors |
| PyTorch caching allocator | Tensors, model weights, intermediates |
| CuPy memory pool | Kernel outputs, NV12→RGB results |
| TensorRT workspace | Engine workspace, activation memory |
| NVDEC mapped surfaces | Decoded video frame DPB surfaces |
The address 0x725f_fb800000 is consistent with a large allocation from PyTorch, CuPy, or TensorRT — the kind of memory that gets allocated and freed repeatedly during video inference.
Part 3: Who Faulted and Why
ENGINE GRAPHICS GPC10 GPCCLIENT_T1_5
This identifies exactly which hardware unit on the GPU tried to access the invalid address. ENGINE GRAPHICS means a CUDA kernel was executing — not a memory copy or NVDEC decode operation. GPC10 is Graphics Processing Cluster #10, one of many parallel compute clusters on the die. GPCCLIENT_T1_5 is Texture Unit Level 1, pipe 5 — the hardware unit that handles memory loads for CUDA kernels.
When a CUDA kernel reads from global memory (e.g., reading a tensor element), the request goes through the T1 texture path. This tells us a compute kernel was trying to read data that used to exist but no longer does.
ACCESS_TYPE_VIRT_READ
VIRT_READ means a compute kernel tried to read from the faulting address. This is distinct from VIRT_WRITE, which would indicate a DMA engine trying to write decoded pixels into a surface. VIRT_READ faults are the hallmark of use-after-free: the data existed once, the kernel has a valid pointer, but the backing pages were removed between pointer capture and access.
The Timeline of a Use-After-Free Fault
- Allocator creates memory at
0x725f...→ PDE chain created, address is valid. 2. Tensor/buffer lives at this address → kernels can read/write successfully. 3. Allocator frees the memory (pool teardown) → PDE chain torn down, address is invalid. 4. Another kernel still holds the pointer → issues load instruction → MMU walks page table → finds empty PDE → Xid 31.
Part 4: Why This Happens in GPU Video Pipelines
The Multi-Allocator Problem
A GPU video inference pipeline has at least four independent memory allocators sharing one 49-bit virtual address space:
- PyTorch caching allocator — manages tensor memory with lazy deallocation
- CuPy memory pool — manages CuPy array memory separately from PyTorch
- TensorRT workspace — pre-allocated scratch space for inference kernels
- NVDEC DPB surfaces — hardware-managed decoded frame buffers
Each allocator has its own view of what memory is “in use” vs “free.” The danger arises when allocators share memory across boundaries without coordinating lifecycle — especially through DLPack zero-copy transfers.
The DLPack Zero-Copy Trap
DLPack enables zero-copy sharing between frameworks — torch.from_dlpack(cupy_array) creates a PyTorch tensor that points to CuPy-owned memory without copying. This is fast but dangerous:
- CuPy’s memory pool tracks the allocation
- PyTorch’s tensor holds a pointer to the same physical memory
- If CuPy’s pool releases the block (via
free_all_blocks()), the PDE is torn down - PyTorch’s tensor now points to invalid virtual memory
- Next kernel that reads the tensor → FAULT_PDE → Xid 31
NVDEC Surface Mapping Lifecycle
NVDEC adds another layer of complexity. The cuvidMapVideoFrame() API maps decoded frames from the hardware DPB (Decoded Picture Buffer) into the CUDA virtual address space. The critical constraint: at most ulNumOutputSurfaces frames can be mapped simultaneously. If you hold too many surfaces mapped (via DLPack views, slow processing, etc.), cuvidMapVideoFrame() fails on the next frame. If you unmap too eagerly (via garbage collection of DLPack capsules), in-flight CUDA kernels lose their source data.
The DLPack Trap in Practice
DLPack zero-copy is not free. Every zero-copy handoff creates a shared-ownership problem. If either side frees the memory, the other gets an Xid 31. In production pipelines, always clone tensors across allocator boundaries.
Part 5: How We Fixed It
We encountered three distinct Xid 31 variants over two days of debugging:
| Day | Count | Type | Address Range | Root Cause |
|---|---|---|---|---|
| Day 0 | 28 | VIRT_WRITE | Various | NVDEC DPB exhaustion — too many RGB surfaces (3 B/px) held simultaneously |
| Day 1 | 2 | VIRT_READ | High (0x7b77, 0x0_48) | CuPy kernel reading directly from NVDEC surface — cross-engine hazard |
| Day 1 | 3 | VIRT_READ | High (0x7f91, 0x7f3c, 0x7e64) | CuPy pool teardown racing with DLPack-backed PyTorch tensors |
| Day 2 | 1 | VIRT_READ | High (0x725f) | Residual allocator race — same mechanism as Day 1 batch 3 |
Fix 1: NV12 Output Mode (28 → 2 Xid/day)
Switched NVDEC from RGB output (3 bytes/pixel) to NV12 native output (1.5 bytes/pixel). This halved the DPB surface memory footprint, eliminating the VIRT_WRITE faults caused by DPB exhaustion. The hardware decoder uses half the memory per surface, so more frames fit within the DPB limit simultaneously.
Fix 2: Two-Phase Clone (2 → 3 Xid/day, different type)
Cloned NV12 surfaces out of NVDEC DPB memory before running the CuPy conversion kernel. This eliminated the cross-engine hazard where a GPC compute kernel read directly from NVDEC-mapped surfaces. The crash count slightly increased (3 vs 2) because a different race surfaced: CuPy free_all_blocks() during RGB conversion.
Fix 3: Clone RGB Output + Remove free_all_blocks() (3 → 0 Xid/day)
Two changes that together eliminated the allocator race:
- Clone CuPy RGB output into PyTorch memory — the final tensor lives in PyTorch’s caching allocator, completely decoupled from CuPy’s pool
- Remove
free_all_blocks()from chunk cleanup — CuPy’s pool manages reuse internally; aggressive teardown was racing with pending DLPack deleters
Results
After all three fixes: 0 Xid 31 faults across 7,000+ videos processed, sustaining ~630 videos/minute on 4 inference workers. The complete data path is now fully isolated — NVDEC surfaces are cloned into PyTorch-owned memory immediately after decode, and no allocator can tear down pages that another framework references.
Quick Reference for Xid 31 Diagnosis
Error Message Field Reference
| Field | Meaning |
|---|---|
PCI:0000:c1:00 | GPU PCI bus address |
31 | Xid error code — MMU fault |
pid=2646416 | Linux process ID that triggered the fault |
channel 0x1a | GPU hardware channel (execution queue) |
ENGINE GRAPHICS | Fault from compute/graphics engine (not COPY or NVDEC) |
GPC10 | Graphics Processing Cluster #10 |
GPCCLIENT_T1_5 | Texture unit #5 within GPC10 (handles memory loads) |
0x725f_fb800000 | Virtual address that failed translation |
FAULT_PDE | Page Directory Entry missing (large region unmapped) |
ACCESS_TYPE_VIRT_READ | A kernel tried to read from the address |
Fault Type Comparison
| Fault | Level | Unmapped Region | Typical Cause |
|---|---|---|---|
| FAULT_PDE | Intermediate page directory | ≥ 2 MB | Large allocation freed, pool teardown |
| FAULT_PTE | Final page table | 4 KB page | Single page evicted or corrupted |
Access Type Comparison
| Access | Engine | Meaning |
|---|---|---|
| VIRT_WRITE | Usually NVDEC DMA | Hardware engine writing to unmapped surface |
| VIRT_READ | Usually GPC/SM | Compute kernel reading from freed memory |
Address Range Heuristics
| Address Range | Likely Source |
|---|---|
0x0 – 0x1_00000000 | Driver internals, channel memory |
0x1 – 0x10_00000000 | Small CUDA allocations, constants |
0x10 – 0x7FFF_FFFFFFFF | Large allocations: tensors, TRT workspace, NVDEC, CuPy |
Sources
- NVIDIA Xid Error Documentation
- Analyzing Xid Errors with the Xid Catalog
- CUDA Virtual Memory Management
- NVDEC Video Decoder API Programming Guide
- Introducing Low-Level GPU Virtual Memory Management
- Pascal MMU Format Changes
- NVIDIA Developer Forums: FAULT_PDE Discussion
- Optimizing Video Memory with NVDECODE API
