NVIDIA vs AMD for Deep Learning: CUDA vs ROCm and the Datacenter Accelerators

Summary: NVIDIA vs AMD for deep learning compared at both layers: the CUDA vs ROCm software moat, the microarchitecture (warp vs wavefront, SM vs CU, Tensor vs Matrix Cores), and the datacenter accelerators (H100/H200/B200 vs MI300X/MI325X).

Choosing between NVIDIA and AMD for deep learning is really two choices stacked on top of each other: the silicon (how fast it does the matmuls, how much model fits in memory) and the software (whether your training stack runs at all). For most teams the software choice dominates — NVIDIA's CUDA ecosystem is the default the entire field is built on — but AMD's accelerators have closed most of the hardware gap and lead on one axis that matters a lot for large models: memory capacity per GPU.

This page compares the two at both layers: the CUDA-vs-ROCm software moat, the microarchitecture down to warp vs wavefront, and the datacenter accelerators head-to-head.

At a glance: which one for which problem?

Situation	Pick	Why
Standard PyTorch/JAX training that has to "just work"	NVIDIA	CUDA is the default; day-0 support for every framework, library, and kernel
Serving a very large model that is memory-bound	AMD	MI300X/MI325X ship 192–256 GB HBM per GPU vs 80–192 GB on NVIDIA
Large multi-node training, collective-heavy communication	NVIDIA	NVLink + NVSwitch scale coherent bandwidth further than Infinity Fabric today
Stack is plain PyTorch ops, no custom CUDA kernels	Either	ROCm runs mainstream PyTorch; the gap is custom/CUDA-only code
You depend on TensorRT, custom CUDA kernels, niche libraries	NVIDIA	Many are CUDA-only; porting to HIP is real work
Cost/availability pressure, willing to invest in ROCm	AMD	Often better $/GB and availability; budget for ecosystem friction

The real divide is software

CUDA is not just a language — it is a 15-year stack: the compiler (nvcc), the libraries (cuDNN, cuBLAS, CUTLASS, NCCL, TensorRT), and the fact that every framework targets it first. AMD's answer is ROCm, an open stack with matching pieces: HIP (the CUDA-like language), MIOpen (≈ cuDNN), rocBLAS / hipBLASLt (≈ cuBLAS), RCCL (≈ NCCL), and Composable Kernel (≈ CUTLASS).

The moat is not any single library — it is default support. New models, kernels, and research code are written and tested on CUDA first. ROCm support arrives later and ranges from "works out of the box" to "needs a port." That lag is the single biggest risk in choosing AMD.

HIP is AMD's portability bet: a thin CUDA-like API where hipify mechanically translates most CUDA source to HIP, which then compiles for either vendor. It gets you 80–90% of the way; hand-tuned kernels and vendor-specific intrinsics are the rest. (For the CUDA execution model these build on, see CUDA context vs streams.)

Execution model and microarchitecture

Warp vs wavefront — SIMT lockstep width

NVIDIA

warp · 32 lanes

A half-taken branch idles lanes in groups of 32.

AMD · CDNA (MI series)

wavefront (wave64) · 64 lanes

Same divergence can idle up to 64 lanes. Consumer RDNA adds a wave32 mode; the accelerators stay wave64.

SM vs CU — the schedulable core

NVIDIA

Streaming Multiprocessor (SM) · ~132 per H100

AMD

Compute Unit (CU) · ~304 per MI300X

Teal box = the warp/wavefront grouping · zinc = lanes and scheduler/SIMD sub-blocks · amber = the matrix unit. More, narrower CUs is a different path to the same throughput.

Warp vs wavefront

The most-quoted hardware difference: NVIDIA executes threads in warps of 32; AMD's datacenter GPUs (the CDNA / MI series) execute wavefronts of 64. Both are SIMT — one instruction stream drives many lanes in lockstep — so that number is the granularity at which the hardware schedules and, crucially, at which branch divergence is paid.

When threads in a warp or wavefront take different branches, the hardware runs each path with the non-participating lanes masked off. Divergence on NVIDIA wastes lanes in groups of 32; on a 64-wide wavefront the same divergent code can idle up to 64 lanes. The coarser group also changes occupancy math: a 40-thread block is 2 warps (24 idle lanes) on NVIDIA but a single wavefront with 24 idle lanes on AMD. AMD's consumer RDNA architecture added a wave32 mode precisely to narrow this gap, but the CDNA accelerators you train on are wave64.

Both are flavors of SIMD execution — see Flynn's taxonomy.

SM vs CU

NVIDIA's Streaming Multiprocessor (SM) and AMD's Compute Unit (CU) are the same idea — the schedulable core that holds the resident warps/wavefronts and the on-chip memory. An H100 has ~132 SMs, each with 4 warp schedulers and 128 FP32 lanes. An MI300X has ~304 CUs, each with 4 SIMD units of 16 lanes that issue a wave64 over 4 cycles. More CUs and narrower-but-more SIMDs is a different path to the same throughput.

Tensor Cores vs Matrix Cores

The matmul engines have different names and instructions. NVIDIA's Tensor Cores (programmed via WMMA, or for free through cuBLAS/cuDNN/CUTLASS) accelerate FP16/BF16/FP8/INT8, plus FP4 on Blackwell. AMD's equivalent is Matrix Cores, driven by MFMA instructions (via rocWMMA / Composable Kernel), covering FP16/BF16/FP8/INT8. For mainstream training you rarely touch these directly — you rely on the library — which is exactly why library maturity matters so much. See tensor cores.

On-chip memory

Both follow the same hierarchy: registers → a programmer-managed scratchpad → L2 → HBM. The scratchpad has two names — NVIDIA's shared memory (configurable up to 228 KB per SM on Hopper) and AMD's LDS (Local Data Share) (64 KB per CU). Larger shared memory lets NVIDIA stage bigger tiles per block; AMD offsets with a very large last-level cache (256 MB Infinity Cache on MI300X). See memory hierarchy.

Terminology: CUDA ↔ HIP/ROCm

Same concepts, different words — the Rosetta stone for reading AMD docs:

CUDA / NVIDIA	HIP / AMD (ROCm)
Thread	Work-item
Warp (32)	Wavefront (64)
Thread block	Workgroup
Grid	Grid
SM (Streaming Multiprocessor)	CU (Compute Unit)
Shared memory	LDS (Local Data Share)
Tensor Core	Matrix Core (MFMA)
CUDA core	Stream processor / SIMD lane
`nvcc`	`hipcc`

The accelerators, side by side

Specs date fast; this is a mid-2026 snapshot of the parts most teams actually train on. The numbers move, but the shape of the comparison is stable.

	H100 SXM	H200	B200	MI300X	MI325X
Vendor	NVIDIA	NVIDIA	NVIDIA	AMD	AMD
Architecture	Hopper	Hopper	Blackwell	CDNA 3	CDNA 3
HBM capacity	80 GB	141 GB	192 GB	192 GB	256 GB
HBM bandwidth	3.35 TB/s	4.8 TB/s	~8 TB/s	5.3 TB/s	6.0 TB/s
Scale-up link	NVLink 900 GB/s	NVLink 900 GB/s	NVLink 1.8 TB/s	Infinity Fabric	Infinity Fabric

Read it this way. AMD leads on memory capacity per GPU — the MI300X matched NVIDIA's flagship at 192 GB while the H100 was at 80, and the MI325X pushes to 256 GB. That directly decides how large a model (or how long a context) fits without sharding across GPUs. NVIDIA leads on the fabric — NVLink plus NVSwitch give coherent all-to-all bandwidth across 8 GPUs (and, with NVL72, 72) that Infinity Fabric does not yet match at rack scale, which matters most for collective-heavy multi-node training. Newer parts (NVIDIA's Blackwell Ultra, AMD's MI350 / CDNA 4) continue the same shape: AMD pushing capacity, NVIDIA pushing fabric and software. For the communication angle see NCCL and multi-GPU communication.

Ecosystem reality

What actually runs on AMD today, and where the friction is:

PyTorch — official ROCm builds; mainstream models train and serve. The happy path is genuinely happy.
vLLM / SGLang — ROCm support exists and is used in production for inference; flagship models are supported, though the newest features sometimes lag.
Triton — has an AMD backend, so Triton-authored kernels are portable across vendors.
FlashAttention — available on ROCm via Composable Kernel / Triton ports, not always the newest variant on day one.
The friction — anything CUDA-only: TensorRT, hand-written CUDA kernels in a research repo, and brand-new techniques that ship CUDA-first. These need a HIP port or a wait.

Rule of thumb: if your workload is standard PyTorch ops and well-known models, AMD is viable today. If it is custom kernels or bleeding-edge research code, NVIDIA removes a whole class of problems.

When to choose which

Choose NVIDIA when you want the default that everything targets, you depend on CUDA-only libraries (TensorRT, custom kernels), you are doing large multi-node training that leans on NVLink/NVSwitch, or engineering time is scarcer than hardware budget.

Choose AMD when you are memory-bound and want 192–256 GB per GPU to fit big models or long contexts with less sharding, your stack is mainstream PyTorch, you have cost or availability pressure, and you can budget some ecosystem friction.

Either is fine when you run standard models through PyTorch, do more inference than training, and never touch custom kernels — ROCm covers this case well.

The deciding question: "Does my stack depend on CUDA-only code, and do I need raw per-GPU memory or coherent multi-GPU scale-up?" CUDA-only dependencies point to NVIDIA; a memory-bound single-GPU fit points to AMD; rack-scale collective training points back to NVIDIA.

FAQ

Can I run PyTorch on AMD? Yes — there are official ROCm PyTorch builds, and most models run with no code changes. Custom CUDA kernels are the exception.

Is ROCm production-ready? For mainstream training and inference on supported GPUs, yes — it is deployed at scale. "Production-ready" is workload-specific: standard models, yes; exotic kernels, test first.

Does CUDA run on AMD? No. CUDA is NVIDIA-only. AMD's path is HIP — recompile (mostly mechanically) from CUDA-like source. Compatibility-layer projects such as ZLUDA exist but are not a production strategy.

Why is a warp 32 but a wavefront 64? Different microarchitectural choices. NVIDIA standardized on 32-wide SIMT groups; AMD's GCN/CDNA line used 64-wide. Neither is universally better — 32 makes divergence and small blocks cheaper, 64 amortizes scheduling over more lanes. AMD's RDNA added wave32 for divergence-heavy consumer workloads.

TL;DR

NVIDIA wins on software (CUDA is the default the whole field targets) and rack-scale fabric (NVLink/NVSwitch); AMD wins on memory capacity per GPU (192–256 GB) and often cost/availability. The hardware is close — warp-32 vs wavefront-64, SM vs CU, Tensor vs Matrix Cores are different names for the same ideas. The real question is your software: standard PyTorch runs on both, but CUDA-only kernels and bleeding-edge research code still pull you to NVIDIA.

GPU & High-Performance Computing

CUDA Context vs Streams vs MPS: Process Isolation, Concurrency, and Multi-Tenancy

How CUDA contexts, streams, and MPS compare: a context is a per-process container of GPU state, a stream is an in-order queue inside a context, and MPS lets multiple processes share a single GPU concurrently. Three layers, three different problems.

GPU & High-Performance Computing

CUDA Multi-Process Service (MPS): GPU Sharing for Concurrent Workloads

Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.

GPU & High-Performance Computing

CUDA Streams: Asynchronous Execution and Concurrency

A CUDA stream is a queue of GPU operations that execute in order. Understanding streams is the difference between a GPU at 30% utilization and one running flat out — they are how kernels and memory copies overlap on real hardware.

GPU & High-Performance Computing

NVIDIA Device Files in /dev/

Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.

GPU & High-Performance Computing

Understanding CUDA Contexts

Explore the concept of CUDA contexts, their role in managing GPU resources, and how they enable parallel execution across multiple CPU threads.

GPU & High-Performance Computing

Understanding NVIDIA Kubernetes GPU Operator

Automate NVIDIA GPU management in Kubernetes with the GPU Operator. Deploy drivers, device plugins, and monitoring as DaemonSets.