Choosing between NVIDIA and AMD for deep learning is really two choices stacked on top of each other: the silicon (how fast it does the matmuls, how much model fits in memory) and the software (whether your training stack runs at all). For most teams the software choice dominates — NVIDIA's CUDA ecosystem is the default the entire field is built on — but AMD's accelerators have closed most of the hardware gap and lead on one axis that matters a lot for large models: memory capacity per GPU.
This page compares the two at both layers: the CUDA-vs-ROCm software moat, the microarchitecture down to warp vs wavefront, and the datacenter accelerators head-to-head.
At a glance: which one for which problem?
| Situation | Pick | Why |
|---|---|---|
| Standard PyTorch/JAX training that has to "just work" | NVIDIA | CUDA is the default; day-0 support for every framework, library, and kernel |
| Serving a very large model that is memory-bound | AMD | MI300X/MI325X ship 192–256 GB HBM per GPU vs 80–192 GB on NVIDIA |
| Large multi-node training, collective-heavy communication | NVIDIA | NVLink + NVSwitch scale coherent bandwidth further than Infinity Fabric today |
| Stack is plain PyTorch ops, no custom CUDA kernels | Either | ROCm runs mainstream PyTorch; the gap is custom/CUDA-only code |
| You depend on TensorRT, custom CUDA kernels, niche libraries | NVIDIA | Many are CUDA-only; porting to HIP is real work |
| Cost/availability pressure, willing to invest in ROCm | AMD | Often better $/GB and availability; budget for ecosystem friction |
The real divide is software
CUDA is not just a language — it is a 15-year stack: the compiler (nvcc), the libraries (cuDNN, cuBLAS, CUTLASS, NCCL, TensorRT), and the fact that every framework targets it first. AMD's answer is ROCm, an open stack with matching pieces: HIP (the CUDA-like language), MIOpen (≈ cuDNN), rocBLAS / hipBLASLt (≈ cuBLAS), RCCL (≈ NCCL), and Composable Kernel (≈ CUTLASS).
The moat is not any single library — it is default support. New models, kernels, and research code are written and tested on CUDA first. ROCm support arrives later and ranges from "works out of the box" to "needs a port." That lag is the single biggest risk in choosing AMD.
HIP is AMD's portability bet: a thin CUDA-like API where hipify mechanically translates most CUDA source to HIP, which then compiles for either vendor. It gets you 80–90% of the way; hand-tuned kernels and vendor-specific intrinsics are the rest. (For the CUDA execution model these build on, see CUDA context vs streams.)
Execution model and microarchitecture
A half-taken branch idles lanes in groups of 32.
Same divergence can idle up to 64 lanes. Consumer RDNA adds a wave32 mode; the accelerators stay wave64.
Warp vs wavefront
The most-quoted hardware difference: NVIDIA executes threads in warps of 32; AMD's datacenter GPUs (the CDNA / MI series) execute wavefronts of 64. Both are SIMT — one instruction stream drives many lanes in lockstep — so that number is the granularity at which the hardware schedules and, crucially, at which branch divergence is paid.
When threads in a warp or wavefront take different branches, the hardware runs each path with the non-participating lanes masked off. Divergence on NVIDIA wastes lanes in groups of 32; on a 64-wide wavefront the same divergent code can idle up to 64 lanes. The coarser group also changes occupancy math: a 40-thread block is 2 warps (24 idle lanes) on NVIDIA but a single wavefront with 24 idle lanes on AMD. AMD's consumer RDNA architecture added a wave32 mode precisely to narrow this gap, but the CDNA accelerators you train on are wave64.
Both are flavors of SIMD execution — see Flynn's taxonomy.
SM vs CU
NVIDIA's Streaming Multiprocessor (SM) and AMD's Compute Unit (CU) are the same idea — the schedulable core that holds the resident warps/wavefronts and the on-chip memory. An H100 has ~132 SMs, each with 4 warp schedulers and 128 FP32 lanes. An MI300X has ~304 CUs, each with 4 SIMD units of 16 lanes that issue a wave64 over 4 cycles. More CUs and narrower-but-more SIMDs is a different path to the same throughput.
Tensor Cores vs Matrix Cores
The matmul engines have different names and instructions. NVIDIA's Tensor Cores (programmed via WMMA, or for free through cuBLAS/cuDNN/CUTLASS) accelerate FP16/BF16/FP8/INT8, plus FP4 on Blackwell. AMD's equivalent is Matrix Cores, driven by MFMA instructions (via rocWMMA / Composable Kernel), covering FP16/BF16/FP8/INT8. For mainstream training you rarely touch these directly — you rely on the library — which is exactly why library maturity matters so much. See tensor cores.
On-chip memory
Both follow the same hierarchy: registers → a programmer-managed scratchpad → L2 → HBM. The scratchpad has two names — NVIDIA's shared memory (configurable up to 228 KB per SM on Hopper) and AMD's LDS (Local Data Share) (64 KB per CU). Larger shared memory lets NVIDIA stage bigger tiles per block; AMD offsets with a very large last-level cache (256 MB Infinity Cache on MI300X). See memory hierarchy.
Terminology: CUDA ↔ HIP/ROCm
Same concepts, different words — the Rosetta stone for reading AMD docs:
| CUDA / NVIDIA | HIP / AMD (ROCm) |
|---|---|
| Thread | Work-item |
| Warp (32) | Wavefront (64) |
| Thread block | Workgroup |
| Grid | Grid |
| SM (Streaming Multiprocessor) | CU (Compute Unit) |
| Shared memory | LDS (Local Data Share) |
| Tensor Core | Matrix Core (MFMA) |
| CUDA core | Stream processor / SIMD lane |
nvcc | hipcc |
The accelerators, side by side
Specs date fast; this is a mid-2026 snapshot of the parts most teams actually train on. The numbers move, but the shape of the comparison is stable.
| H100 SXM | H200 | B200 | MI300X | MI325X | |
|---|---|---|---|---|---|
| Vendor | NVIDIA | NVIDIA | NVIDIA | AMD | AMD |
| Architecture | Hopper | Hopper | Blackwell | CDNA 3 | CDNA 3 |
| HBM capacity | 80 GB | 141 GB | 192 GB | 192 GB | 256 GB |
| HBM bandwidth | 3.35 TB/s | 4.8 TB/s | ~8 TB/s | 5.3 TB/s | 6.0 TB/s |
| Scale-up link | NVLink 900 GB/s | NVLink 900 GB/s | NVLink 1.8 TB/s | Infinity Fabric | Infinity Fabric |
Read it this way. AMD leads on memory capacity per GPU — the MI300X matched NVIDIA's flagship at 192 GB while the H100 was at 80, and the MI325X pushes to 256 GB. That directly decides how large a model (or how long a context) fits without sharding across GPUs. NVIDIA leads on the fabric — NVLink plus NVSwitch give coherent all-to-all bandwidth across 8 GPUs (and, with NVL72, 72) that Infinity Fabric does not yet match at rack scale, which matters most for collective-heavy multi-node training. Newer parts (NVIDIA's Blackwell Ultra, AMD's MI350 / CDNA 4) continue the same shape: AMD pushing capacity, NVIDIA pushing fabric and software. For the communication angle see NCCL and multi-GPU communication.
Ecosystem reality
What actually runs on AMD today, and where the friction is:
- PyTorch — official ROCm builds; mainstream models train and serve. The happy path is genuinely happy.
- vLLM / SGLang — ROCm support exists and is used in production for inference; flagship models are supported, though the newest features sometimes lag.
- Triton — has an AMD backend, so Triton-authored kernels are portable across vendors.
- FlashAttention — available on ROCm via Composable Kernel / Triton ports, not always the newest variant on day one.
- The friction — anything CUDA-only: TensorRT, hand-written CUDA kernels in a research repo, and brand-new techniques that ship CUDA-first. These need a HIP port or a wait.
Rule of thumb: if your workload is standard PyTorch ops and well-known models, AMD is viable today. If it is custom kernels or bleeding-edge research code, NVIDIA removes a whole class of problems.
When to choose which
Choose NVIDIA when you want the default that everything targets, you depend on CUDA-only libraries (TensorRT, custom kernels), you are doing large multi-node training that leans on NVLink/NVSwitch, or engineering time is scarcer than hardware budget.
Choose AMD when you are memory-bound and want 192–256 GB per GPU to fit big models or long contexts with less sharding, your stack is mainstream PyTorch, you have cost or availability pressure, and you can budget some ecosystem friction.
Either is fine when you run standard models through PyTorch, do more inference than training, and never touch custom kernels — ROCm covers this case well.
The deciding question: "Does my stack depend on CUDA-only code, and do I need raw per-GPU memory or coherent multi-GPU scale-up?" CUDA-only dependencies point to NVIDIA; a memory-bound single-GPU fit points to AMD; rack-scale collective training points back to NVIDIA.
FAQ
Can I run PyTorch on AMD? Yes — there are official ROCm PyTorch builds, and most models run with no code changes. Custom CUDA kernels are the exception.
Is ROCm production-ready? For mainstream training and inference on supported GPUs, yes — it is deployed at scale. "Production-ready" is workload-specific: standard models, yes; exotic kernels, test first.
Does CUDA run on AMD? No. CUDA is NVIDIA-only. AMD's path is HIP — recompile (mostly mechanically) from CUDA-like source. Compatibility-layer projects such as ZLUDA exist but are not a production strategy.
Why is a warp 32 but a wavefront 64? Different microarchitectural choices. NVIDIA standardized on 32-wide SIMT groups; AMD's GCN/CDNA line used 64-wide. Neither is universally better — 32 makes divergence and small blocks cheaper, 64 amortizes scheduling over more lanes. AMD's RDNA added wave32 for divergence-heavy consumer workloads.
TL;DR
NVIDIA wins on software (CUDA is the default the whole field targets) and rack-scale fabric (NVLink/NVSwitch); AMD wins on memory capacity per GPU (192–256 GB) and often cost/availability. The hardware is close — warp-32 vs wavefront-64, SM vs CU, Tensor vs Matrix Cores are different names for the same ideas. The real question is your software: standard PyTorch runs on both, but CUDA-only kernels and bleeding-edge research code still pull you to NVIDIA.
Related concepts
How CUDA contexts, streams, and MPS compare: a context is a per-process container of GPU state, a stream is an in-order queue inside a context, and MPS lets multiple processes share a single GPU concurrently. Three layers, three different problems.
Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.
A CUDA stream is a queue of GPU operations that execute in order. Understanding streams is the difference between a GPU at 30% utilization and one running flat out — they are how kernels and memory copies overlap on real hardware.
Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.
Explore the concept of CUDA contexts, their role in managing GPU resources, and how they enable parallel execution across multiple CPU threads.
Automate NVIDIA GPU management in Kubernetes with the GPU Operator. Deploy drivers, device plugins, and monitoring as DaemonSets.
