You see Xid in dmesg and your stomach drops. Maybe it’s a single line buried in thousands of kernel messages. Maybe it’s a flood of them, one per GPU process, cascading across your cluster like dominoes. Either way, your GPU workload is dead, and all you have is a cryptic error code — a two-digit number that could mean anything from “a cosmic ray flipped a bit and hardware already fixed it” to “your GPU is physically dead and needs to be shipped back to NVIDIA.”
This guide covers every major Xid error code you’ll encounter in production: what each one means, how severe it is, and — most importantly — whether you need to fix your code or RMA your hardware. After spending years debugging these errors across training clusters, inference pipelines, and video processing systems, I’ve learned that the difference between a 10-minute fix and a two-week RMA process often comes down to reading the Xid code correctly the first time.
What Are Xid Errors?
Every NVIDIA GPU runs a kernel module called NVRM — the NVIDIA Resource Manager. This is the lowest-level software layer between your CUDA code and the GPU hardware. When the GPU encounters a condition that the driver considers reportable — whether it’s an unrecoverable hardware fault, a corrected memory error, or a thermal throttling event — NVRM emits an Xid error to the Linux kernel log.
Each Xid error has a numeric code (the Xid number) plus contextual fields like the PCI bus address, the process ID that triggered the error, and sometimes a faulting memory address or engine identifier. The format looks like this: NVRM: Xid (PCI:0000:XX:00): CODE, pid=XXXX, name=PROCESS.... You’ll find these in dmesg, journalctl -k, or /var/log/syslog — wherever your system logs kernel messages.
Not all Xid codes are bad news. The severity range is enormous. Xid 94, for example, is purely informational — it tells you that the GPU’s hardware ECC detected and corrected a single-bit memory error. No data was lost, no process was affected, and the GPU continued operating normally. Xid 79, on the other hand, means the GPU is physically unreachable over the PCIe bus. These two errors could not be more different, yet they both show up in the same log format with the same NVRM: Xid prefix. Knowing which codes demand immediate action and which are routine background noise is the single most valuable skill for anyone operating GPU infrastructure.
Where to Find Xid Errors
Check dmesg | grep -i xid, journalctl -k | grep -i xid, or
/var/log/syslog. For persistent monitoring, use nvidia-smi daemon or
forward kernel logs to your monitoring system. In containerized environments,
Xid errors appear in the host kernel log, not inside the container.
Start Here: Triage Your Error
When an Xid error fires, the first question is always: how bad is this? Before diving into the specific error code, start by identifying the symptoms you’re observing. Did a single CUDA process crash while others continue running? Did every GPU process on the machine die simultaneously? Can you still see the GPU in nvidia-smi (which relies on NVIDIA device files), or is it completely gone? The answers narrow down the category of error before you even look at the code.
The flowchart below walks through the initial triage process. Start with what you can observe, and it will guide you to the relevant Xid category and next steps.
The Severity Spectrum
Not all Xid errors are created equal. The NVIDIA documentation groups them loosely, but in practice, you need a clear mental model of four severity levels to make fast operational decisions.
Info errors are the background radiation of GPU operation. The hardware detected a condition, handled it internally, and reported it for your records. No process was affected, no data was corrupted. The only reason to care about info-level Xid codes is if their frequency increases over time — a rising rate of corrected ECC errors, for example, is the canary in the coal mine for an eventual uncorrectable failure.
Warning errors mean something went wrong, but the system may recover. A CUDA kernel hit an exception, or the GPU throttled due to temperature. The specific workload likely failed, but the GPU itself is still operational. Investigate the root cause, but don’t panic.
Critical errors kill CUDA contexts. Whatever was running on the GPU when the error fired is dead — tensors are garbage, inference results are lost, training progress since the last checkpoint is gone. The GPU may still be operational after a reset, but the workload needs to restart from scratch. These demand debugging.
Fatal errors mean the GPU is unreachable, unreliable, or producing corrupted data. At this severity level, you’re not debugging software — you’re diagnosing hardware. The GPU may need to be power-cycled, reseated, or returned to NVIDIA.
Category 1: Memory Errors
Memory errors are the most common Xid codes in production data center GPUs, and they span the full severity spectrum from completely harmless to catastrophic. Understanding the difference between a corrected single-bit flip and an uncorrectable double-bit corruption is essential for every GPU operator.
Xid 48: Double Bit ECC Error
This is the most feared memory Xid. A double-bit error means two bits flipped in the same memory word — ECC can detect this but cannot correct it. The data in that memory location is corrupted, full stop. If this was a tensor in your training run, those gradients are garbage. If it was part of the GPU’s internal page table, the MMU will fault on the next memory access and you’ll see a cascade of Xid 31 errors following the Xid 48.
Double-bit errors are almost always caused by physical DRAM cell degradation. Cosmic rays can cause single-bit errors (which ECC handles silently), but the probability of two bits flipping in the same word from external radiation is astronomically low. When you see Xid 48, the GPU memory is failing. It’s not a question of if you’ll need to replace the GPU, but when. The only variable is whether the failing cells get retired (Xid 63) fast enough to keep the GPU operational until its scheduled replacement.
Xid 63: ECC Page Retirement
When a memory row accumulates too many single-bit ECC errors, the driver permanently retires that row — like marking a bad sector on a hard drive. The GPU continues operating with slightly less usable memory. This is a self-healing mechanism, and in isolation, a single page retirement is not cause for alarm. It means the system is working as designed: identifying failing memory and quarantining it before it causes data corruption.
The concern is the trend. NVIDIA GPUs support up to 64 retired pages before the driver considers the GPU unreliable. You can check the current count with nvidia-smi -q -d RETIRED_PAGES. A GPU that retired 2 pages in its first year of operation is probably fine. A GPU that retired 10 pages in the last month is on a trajectory toward that 64-page limit, and you should start planning its replacement.
There are two retirement types reported in the output: single-bit retirements (caused by accumulated correctable errors) and double-bit retirements (caused by uncorrectable errors). Double-bit retirements are more concerning because each one represents an event where data was actually corrupted before the page was retired.
Xid 94 and Xid 95: Contained vs Uncontained ECC
Xid 94 is the friendliest Xid code you’ll ever see. A single-bit error occurred in GPU memory, the hardware ECC detected it, corrected it on the fly, and no data was lost. The process that triggered the error continued running without interruption. These happen naturally due to cosmic rays, alpha particles emitted from chip packaging materials, and random thermal noise in the DRAM cells. A few Xid 94 events per month per GPU is completely normal in data center environments. Some large clusters see them daily and consider it background noise.
Xid 95 is its dangerous twin. This is an uncorrectable error that the hardware could not contain to the faulting process. While Xid 94 guarantees that only the corrected bit was affected and no corruption leaked, Xid 95 means the error propagated — other processes sharing the GPU, or even the GPU’s own internal management structures, may have received corrupted data. When you see Xid 95, you cannot trust any GPU workload that was running at the time. Every result produced by every process on that GPU around the time of the error is suspect.
The operational difference is stark: Xid 94 is a log entry you review during monthly maintenance. Xid 95 is a page-the-on-call event that requires restarting all affected workloads and potentially validating outputs.
ECC Monitoring Commands
nvidia-smi -q -d ECC shows current volatile and aggregate error counts.
nvidia-smi -q -d RETIRED_PAGES shows permanently retired memory pages and
pending retirements. Monitor these proactively — don’t wait for
Xid 48 to tell you the GPU is failing when the retirement count was already
climbing for weeks.
Category 2: MMU Faults
MMU faults are the GPU equivalent of segmentation faults on a CPU. They occur when the GPU’s Memory Management Unit tries to translate a virtual address and discovers that no valid mapping exists. The CUDA context that triggered the fault is immediately killed.
Xid 31: GPU Page Table Fault
The GPU’s MMU walked the page table hierarchy and found an empty entry — either a Page Directory Entry (PDE) or a Page Table Entry (PTE) was missing. The virtual address that a CUDA kernel, DMA engine, or hardware decoder tried to access simply does not map to any physical memory. This is the most common “your CUDA process crashed” Xid error.
Xid 31 is almost always a software bug. The three most common causes are use-after-free (a framework freed GPU memory while another framework still held a pointer to it), allocator race conditions (multiple memory pools — PyTorch, CuPy, TensorRT — competing for the same virtual address space), and DLPack zero-copy transfers where one framework frees the backing memory while another still references it through a shared tensor.
The error message contains rich diagnostic information: the faulting virtual address, the engine that triggered the fault (GRAPHICS, COPY, NVDEC), the fault type (FAULT_PDE vs FAULT_PTE), and the access type (VIRT_READ vs VIRT_WRITE). Each of these fields narrows down the root cause significantly.
For a complete deep dive into Xid 31 diagnosis, including page table walk mechanics, fault address decoding, and multi-allocator race conditions, see Anatomy of a GPU Crash: Understanding Xid 31 MMU Faults.
Xid 32: Invalid Push Buffer
The GPU command processor found an invalid instruction in the push buffer — the stream of commands that the CPU-side driver sends to the GPU for execution. Every CUDA kernel launch, memory copy, and synchronization primitive is encoded as a series of commands in the push buffer. When the GPU encounters an instruction it doesn’t recognize or a malformed command structure, it fires Xid 32.
This error is rarer than Xid 31 and has different implications. While Xid 31 points to memory lifecycle bugs in your application, Xid 32 usually indicates a problem in the driver itself or host-side memory corruption affecting the command buffer. Mismatched CUDA toolkit and driver versions are a common trigger. If you see Xid 32 consistently, update your GPU driver first. If it persists across driver versions, check for host-side memory errors with memtest86+ — a corrupted command buffer often means something is wrong with system RAM, not GPU RAM.
Category 3: GPU Reset & Recovery
These are the Xid codes that tell you the GPU itself has stopped working or become unreachable. They range from recoverable timeouts to complete physical disconnection.
Xid 79: Fallen Off the Bus
The most dramatic Xid code in the catalog. The CPU tried to communicate with the GPU over the PCIe bus and received no response. The GPU is simply gone from the system’s perspective. If you run lspci after an Xid 79, the GPU will be missing from the PCI device list entirely. nvidia-smi will either hang or report “No devices were found.”
There are three common causes, and they have very different fixes. First: a loose GPU in the PCIe slot. This is more common than most engineers expect, especially in servers that have been shipped, rack-mounted, or had other components serviced. The physical connection between the GPU’s edge connector and the PCIe slot is surprisingly fragile, and even a small displacement can cause intermittent or permanent loss of the link. Reseat the GPU firmly and check that the retention clip is engaged.
Second: inadequate power delivery. High-end GPUs draw 300–700W under full load. If the power supply cannot maintain stable voltage on the 12V rail during peak consumption, the GPU will brown out and disappear from the bus. Check all power cable connections (both the 8-pin/12-pin connectors to the GPU and the PSU-side connections), and verify that the PSU has sufficient wattage for the total system configuration.
Third: genuine hardware failure. If reseating the GPU and verifying power doesn’t resolve the issue, and especially if the same GPU falls off the bus in a different PCIe slot or a different server, the GPU’s PCIe controller has likely failed. This is an RMA situation.
Xid 43: GPU Stopped Processing
The GPU has a hardware watchdog timer that monitors whether submitted work is making progress. When a CUDA kernel, copy engine, or other GPU operation runs for longer than the timeout period without completing, the watchdog fires and the driver reports Xid 43. The driver will then attempt to reset the GPU engine and recover — whether that recovery succeeds depends entirely on the root cause.
An infinite loop in a CUDA kernel will trigger Xid 43 reliably, and no amount of resetting will fix it because the same kernel will hang again on the next launch. A transient PCIe communication glitch might cause a one-time Xid 43 that never recurs after the reset. A GPU with degraded clock speeds due to thermal throttling might hit the timeout on workloads that normally complete well within the limit.
The default timeout on Linux is controlled by the NVreg_RegistryDwords module parameter. If you’re running legitimately long-running kernels (some scientific computing workloads), you may need to increase the timeout. But be cautious — a longer timeout also means a genuinely hung GPU takes longer to detect.
Xid 45: Preemptive GPU Removal
The system or driver forcibly removed the GPU from operation, typically after multiple failed recovery attempts. Think of Xid 45 as the step before Xid 79 — the GPU is still physically connected to the PCIe bus, but the driver has given up trying to use it. The hardware is present, but the software has declared it dead.
Xid 45 often follows a sequence of Xid 43 events: the GPU hangs, the driver resets it, it hangs again, the driver resets it again, and after enough failed recoveries, the driver removes the GPU from its device list. At this point, the only way to bring the GPU back is a full system reboot (or, on some platforms, a PCIe hot-reset via echo 1 > /sys/bus/pci/devices/0000:XX:00.0/remove followed by a rescan).
PCIe Diagnostic Commands
lspci -vv -s XX:00.0 shows PCIe link status and negotiated speed. Look for
“LnkSta” — if it shows a reduced link speed (Gen1 instead of
Gen4) or reduced width (x8 instead of x16), the PCIe connection is degraded.
nvidia-smi -q -d PCIE shows GPU-side PCIe error counters including replay
counts and NAK errors. Rising replay counts indicate a flaky physical
connection.
Category 4: Engine & Software Errors
These Xid codes indicate problems within specific GPU engines or firmware components. They’re generally more debuggable than hardware failures because they often correlate with specific workloads or driver versions.
Xid 13: Graphics Engine Exception
A running shader or CUDA kernel caused a hardware exception on one of the GPU’s Streaming Multiprocessors. This is often recoverable at the GPU level — the specific kernel that faulted is terminated, but other kernels running on different SMs may continue. The CUDA context that launched the faulting kernel is typically destroyed, but the GPU itself remains operational.
Common causes include out-of-bounds memory access within a kernel (reading past the end of a buffer), integer division by zero in shader code, and executing illegal instructions (which can happen if you run a kernel compiled for a newer GPU architecture on an older GPU). Unlike Xid 31, which is a page table fault at the MMU level, Xid 13 is an execution-level exception — the memory address might be valid, but the operation performed on it was illegal.
If Xid 13 only appears with a specific workload, it’s almost certainly a software bug. Run the workload under compute-sanitizer (the successor to cuda-memcheck) to identify the exact instruction and memory access that triggers the exception.
Xid 61 and Xid 62: Internal Microcontroller Errors
Modern NVIDIA GPUs contain multiple internal microcontrollers that run firmware independently of the main GPU engines. The PMU (Power Management Unit) controls voltage and clock domains. SEC2 handles secure boot and encryption. The GSP (GPU System Processor), introduced in Ampere and later architectures, offloads much of the driver’s management work to an on-die RISC-V processor.
Xid 61 indicates that one of these microcontrollers hit a diagnostic breakpoint — a warning-level event where the firmware detected an unexpected condition but continued operating. Xid 62 is more severe: the microcontroller halted execution entirely. When the GSP halts, the GPU effectively loses its management brain, and most operations will fail until the system is power-cycled.
These errors are firmware-level issues. If you encounter them, the first step is always to update the GPU driver (which includes updated firmware for these microcontrollers). If Xid 62 persists across driver versions, check for VBIOS updates from your GPU vendor (NVIDIA, or the board partner like Dell, HPE, or Supermicro). Persistent Xid 62 after firmware updates is an RMA indicator.
Xid 68: Video Processor Exception
The hardware video decoder (NVDEC) or encoder (NVENC) encountered an error during a video processing operation. NVDEC and NVENC are dedicated fixed-function engines on the GPU die, separate from the CUDA cores, and they have their own error reporting path.
The most common triggers are corrupted input video streams (a truncated H.264/HEVC bitstream will cause NVDEC to fault), exceeding the maximum number of concurrent decode sessions (which varies by GPU generation), and using codec parameters that the hardware doesn’t support (like trying to decode AV1 on a GPU that only supports H.264 and HEVC).
If you’re running a video processing pipeline and see Xid 68, this is your “check your inputs” signal. Validate that the input video files are not corrupted, that you’re not exceeding the session limit documented in the Video Codec SDK, and that the codec profile and level are within the GPU’s capabilities. The NVIDIA Video Codec SDK documentation provides per-GPU capability matrices that list exactly which formats each GPU generation supports.
Category 5: Power & Thermal
Power and thermal Xid codes often feel like hardware failures, but they can also be caused by environmental factors — inadequate cooling, overloaded power supplies, or poorly configured power management. Diagnosing these requires looking beyond the GPU itself at the entire system.
Xid 64: ECC/Power Error
This Xid often combines memory and power issues in a way that makes root cause analysis tricky. Insufficient power delivery can cause voltage droops under heavy load — momentary dips in the supply voltage that cause DRAM cells to flip. These voltage-induced bit flips look identical to physical cell degradation from the ECC hardware’s perspective, so you’ll see Xid 64 alongside Xid 48 or Xid 94, making it look like a memory failure when the actual root cause is the power supply.
The diagnostic approach is to check whether the ECC errors correlate with high GPU power draw. If Xid 48 or 94 only appear during peak workloads (large batch training, full-occupancy inference) and never during idle or light use, suspect power delivery. Check the PSU rating, the 12V rail voltage under load (most server BMCs report this), and whether other GPUs on the same power rail show similar symptoms. A failing PSU will often cause correlated errors across multiple GPUs, while genuine DRAM degradation is isolated to a single GPU.
Xid 69: GPU Thermal Event
The GPU’s thermal sensors detected temperatures approaching or exceeding the maximum operating threshold — typically 83–90°C for data center GPUs, though the exact limit varies by model and board partner configuration. The driver’s response is progressive: it first throttles GPU clocks to reduce power consumption and heat generation, then throttles further if temperatures continue rising, and finally shuts down the GPU entirely if the thermal limit is exceeded.
You’ll see Xid 69 most often in dense server configurations where multiple GPUs share an airflow path, in data centers with inadequate cooling capacity, or in systems where dust has accumulated on heatsinks and fans over months of operation. Before assuming hardware failure, check the basics: are all fans spinning? Is the heatsink clear of dust? Is the ambient air temperature in the data center within specification (typically 18–27°C)? Is the server positioned so that exhaust air from one GPU feeds directly into the intake of another?
If the cooling is adequate and Xid 69 still occurs, the GPU’s thermal interface material (TIM) between the die and heatsink may have degraded. This is less common in data center GPUs than in consumer cards but does happen over multi-year deployments.
Xid 74: NVLink Error
Only relevant in multi-GPU configurations connected via NVLink bridges — the high-bandwidth GPU-to-GPU interconnect that bypasses the PCIe bus. NVLink provides 600+ GB/s of bidirectional bandwidth between GPUs, but the physical connection is susceptible to the same kinds of issues as any high-speed serial link: connector contamination, mechanical stress, and signal integrity degradation.
When Xid 74 fires, run nvidia-smi nvlink -s to see per-link error statistics. NVLink has its own error correction, similar to ECC for memory, so occasional corrected errors are normal. Rising uncorrected error counts or complete link failures indicate a physical problem with the NVLink bridge (the physical connector between GPUs) or the NVLink interface on one of the GPUs.
The first step is always to reseat the NVLink bridge. If errors persist, try a different bridge (if available) to isolate whether the problem is the bridge or the GPU. If the same GPU shows NVLink errors with multiple bridges, the GPU’s NVLink interface may be damaged — this is an RMA scenario, especially in systems where NVLink is critical for training performance (like DGX or HGX configurations).
Reading the Error Patterns
Individual Xid codes tell you what happened. The pattern of Xid errors over time tells you why. A single Xid 94 is noise. Ten Xid 94 events in an hour, followed by an Xid 63 page retirement, followed by an Xid 48 double-bit error the next day — that’s a GPU telling you its memory is dying, in slow motion, with increasing urgency.
Similarly, a cluster of Xid 43 events (GPU hung) followed by Xid 45 (preemptive removal) followed by Xid 79 (fallen off bus) tells a clear story: the GPU is progressively failing, each recovery attempt buys less time, and eventually the hardware gives up entirely.
The dashboard below shows common temporal patterns and what they indicate about underlying root causes.
Is It Hardware or Software?
This is the million-dollar question — sometimes literally, when you’re deciding whether to RMA a $30,000 GPU or spend another week debugging your CUDA code. The wrong call in either direction is expensive: shipping a working GPU back to NVIDIA wastes weeks of cluster capacity, while continuing to debug “software” on a physically failing GPU wastes engineering time chasing a ghost.
The key diagnostic principle is reproducibility with a different workload. If the same Xid error occurs regardless of what software is running on the GPU — different CUDA applications, different frameworks, even NVIDIA’s own diagnostic tools — the hardware is the problem. If the error only appears with a specific workload, driver version, or software configuration, the software is the problem (or at least a contributing factor).
There are exceptions to this heuristic. Memory errors (Xid 48, 94, 95) can be hardware failures that only manifest under specific memory access patterns — a particular allocation size or address range might stress the failing cells in a way that other workloads don’t. This is why NVIDIA’s GPU diagnostic tools (like dcgm diagnostics and the field diagnostic suite) include stress tests that specifically target memory with randomized patterns.
Quick Reference
Need to look up a specific code fast? Search or filter the complete Xid catalog below.
Understanding the Categories
Every Xid error falls into one of several functional categories based on which GPU subsystem reported it. Understanding these categories helps you build intuition: when you see a new Xid code for the first time, knowing its category immediately tells you which part of the GPU stack to investigate and which diagnostic tools to reach for.
When to RMA
Knowing when to RMA a GPU is as much about pattern recognition as it is about individual error codes. Here are the clear indicators that hardware replacement is the right call.
RMA your GPU when Xid 48 (double-bit ECC) occurs more than once in a week — a single double-bit error might be a fluke, but repeated events indicate progressive DRAM failure. RMA when the retired page count (checked via nvidia-smi -q -d RETIRED_PAGES) exceeds 32 — that’s halfway to the 64-page limit, and the trend is unlikely to reverse. RMA when Xid 79 persists after reseating the GPU and trying a different PCIe slot, which rules out mechanical and slot-specific causes. RMA when Xid 95 (uncontained ECC) occurs alongside an increasing rate of Xid 94 (contained ECC), which indicates memory degradation that has progressed beyond ECC’s ability to contain. And RMA when you see multiple different Xid codes appearing without a clear software cause — a GPU producing Xid 48, 43, and 79 in the same week is failing in multiple subsystems.
Do NOT RMA when you see occasional Xid 94 events — these are normal, expected, and handled by hardware. Do not RMA when Xid 31 or 32 only appears with a specific workload — that’s a software bug in the workload, not a hardware defect. Do not RMA when Xid 69 is caused by environmental factors like poor airflow, high ambient temperature, or dust accumulation — fix the cooling first. And do not RMA when Xid 43 only happens with a specific CUDA kernel — the kernel is hanging, not the GPU. Debug the kernel.
The gray area is Xid 64 (ECC/power), where the root cause could be the GPU or the power supply. Before initiating an RMA, swap the suspect GPU into a different server or a different power rail. If the errors follow the GPU, it’s the GPU. If they stay with the server, it’s the power infrastructure.
Monitoring Setup
For production GPU clusters, reactive debugging is not enough. By the time an engineer reads an Xid error in a log, the workload has already failed, the training run has lost progress, and the customer has noticed the latency spike. Proactive monitoring turns Xid errors from firefighting triggers into early warning signals.
Set up alerts at four levels of urgency. Immediate attention for any Xid 48 (double-bit ECC), Xid 64 (ECC/power), Xid 79 (fallen off bus), or Xid 45 (preemptive removal) — these are fatal errors that indicate a GPU is dead or dying and needs immediate intervention. Urgent investigation for Xid 95 (uncontained ECC), which means workload outputs may be corrupted. Trending alerts for Xid 94 rates exceeding 10 per day per GPU (escalating correctable ECC) and Xid 63 counts increasing (page retirement trend) — these don’t require immediate action but indicate a GPU that will eventually fail. Periodic review for Xid 13 (engine exception) and Xid 43 (GPU stopped), which may indicate software bugs in deployed workloads.
Use nvidia-smi dmon -s pce -d 60 for continuous monitoring of power, clocks, and ECC counters in 60-second intervals. For cluster-wide visibility, the NVIDIA Data Center GPU Manager (DCGM) provides a centralized monitoring framework with Prometheus-compatible metrics export. Forward kernel logs to your SIEM or monitoring system and alert on the regex pattern NVRM: Xid.
Monitoring One-Liner
journalctl -k -f | grep --line-buffered -i 'xid' | tee -a /var/log/gpu-xid.log will tail kernel messages for Xid errors in real time
and save them to a dedicated log file. Combine with logrotate to prevent
unbounded growth.
