Overview
Tensor Cores are specialized processing units found in modern NVIDIA GPUs that accelerate matrix multiplication and convolution operations, the fundamental building blocks of deep learning. Introduced with the Volta architecture, Tensor Cores can deliver large speedups for eligible AI workloads by executing hardware-accelerated matrix multiply-accumulate operations at lower precision while preserving higher precision where it matters. The actual gain depends on architecture, precision mode, tensor shapes, memory behavior, and kernel implementation. Their reduced-precision paths are central to modern quantization workflows that shrink model footprints while preserving accuracy.
Unlike traditional CUDA cores that process scalar and vector operations, Tensor Cores operate on matrix fragments cooperatively across a warp, making them useful for the dense matrix computations required in neural network training and inference.
Key Concepts
Mixed Precision Computing
Uses lower precision formats such as FP16, BF16, TF32, INT8, or FP8 for eligible math while retaining FP32 accumulation, parameters, or optimizer state where needed for accuracy
Matrix Multiply-Accumulate
Accelerates D = A x B + C on warp-level matrix fragments. Public fragment shapes vary by architecture, precision, and API rather than being one fixed hardware tile size
Warp-Level Operations
Coordinates a warp of 32 threads, with each thread contributing part of the fragment load, multiply-accumulate, and store sequence
Automatic Mixed Precision
Framework support chooses eligible Tensor Core paths while keeping numerically sensitive operations in safer precision
The precision–alignment playground
Two dials decide whether Tensor Cores fire and how fast: the precision you compute in, and whether your matrix shape lines up with the hardware tile. Try both.
How It Works
Fragment Loading
Load matrix fragments from shared or global memory into WMMA fragments
wmma::load_matrix_sync(a_frag, tile_ptr, stride);Matrix Multiply-Accumulate
Perform a warp-level multiply-accumulate on the loaded fragments
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);Result Storage
Store the accumulator fragment back to shared or global memory
wmma::store_matrix_sync(out_ptr, c_frag, stride);Precision Management
Use framework AMP to choose eligible lower-precision kernels and manage gradient scaling
with autocast("cuda", dtype=torch.float16): output = model(input)Inside one Tensor Core operation
A single Tensor Core instruction is four warp-level steps. Step through them, then see why real kernels overlap the loads instead of running them serially.
Real-World Applications
Deep Learning Training
Accelerate eligible forward and backward GEMM/convolution work in neural network training
Large Language Models
Enable high-throughput transformer training and inference
Computer Vision
Speed up convolution and matrix-heavy CNN operations
Scientific Computing
Accelerate dense linear algebra where reduced precision is acceptable
Recommendation Systems
Improve dense projection and MLP throughput around embedding-heavy models
Quantized Inference
Use INT8, INT4, or FP8 paths on supported architectures after calibration or quantization-aware training
Performance Characteristics
| Metric | Value | Notes |
|---|---|---|
| A100 FP16/BF16 Tensor Core | 312 TFLOPS | Dense theoretical peak; 624 TFLOPS with sparsity |
| A100 TF32 Tensor Core | 156 TFLOPS | Dense theoretical peak; 312 TFLOPS with sparsity |
| A100 INT8 Tensor Core | 624 TOPS | Integer inference peak; 1,248 TOPS with sparsity |
| Memory Bandwidth | 1.6-2.0 TB/s | A100 configuration dependent; memory-bound workloads may not reach compute peak |
| Observed Speedup | Workload dependent | Requires Tensor Core eligible ops, aligned dimensions, and enough arithmetic intensity |
CUDA cores vs Tensor Cores
Why bother? Same GEMM, two engines — race them.
Using Tensor Cores with PyTorch
import torch from torch.amp import autocast, GradScaler # Keep model parameters in their normal dtype. AMP chooses eligible # lower-precision kernels without manually calling model.half(). device = "cuda" model = model.to(device) scaler = GradScaler("cuda") for epoch in range(num_epochs): for batch in dataloader: inputs = batch["input"].to(device, non_blocking=True) targets = batch["target"].to(device, non_blocking=True) optimizer.zero_grad(set_to_none=True) with autocast("cuda", dtype=torch.float16): outputs = model(inputs) loss = criterion(outputs, targets) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() # Explicit Tensor Core eligible matmul with CUDA tensors. # Multiples of 8 or 16 are common alignment targets, depending on # precision, architecture, and library kernel. N = 4096 A = torch.randn((N, N), dtype=torch.float16, device=device) B = torch.randn((N, N), dtype=torch.float16, device=device) C = A @ B
Advantages & Limitations
Advantages
- ✓Large throughput gains for eligible GEMM and convolution workloads
- ✓Reduced activation memory with mixed precision training
- ✓Automatic framework integration through PyTorch AMP and other libraries
- ✓Maintains model accuracy when autocast and gradient scaling are used correctly
- ✓Enables larger models or batches on the same hardware
- ✓Strong power efficiency for supported matrix operations
Limitations
- ×Requires supported NVIDIA architectures and eligible kernels
- ×Best performance depends on dimension alignment and memory layout
- ×Not all layers or workloads are matrix-bound
- ×Quantized inference requires calibration and accuracy validation
- ×Manual dtype casts can reduce stability or bypass AMP decisions
- ×Peak hardware numbers are theoretical and workload dependent
Common Pitfalls to Avoid
Misaligned Matrix Dimensions
Matrix and channel sizes that do not align with library kernel requirements can reduce Tensor Core utilization
Manual Casting In Training
Calling model.half() or manually casting many tensors can bypass AMP's per-op precision choices
Assuming Peak Throughput
Theoretical TFLOPS or TOPS numbers require ideal shapes, precision modes, sparsity support, and compute-bound kernels
Skipping Tensor Core Profiling
A workload may run correctly with AMP while still falling back to non-Tensor Core kernels
Best Practices
- Use Automatic Mixed Precision: Start with torch.amp.autocast and GradScaler for training instead of hand-converting the model to FP16
- Align Dimensions Deliberately: Use multiples of 8 or 16 where the target precision and library kernels benefit, and pad only when measurements support it
- Profile Tensor Core Utilization: Use Nsight Compute, PyTorch profiler, or vendor library logs to verify Tensor Core kernels and memory bottlenecks
- Choose Precision By Workload: Use FP16, BF16, TF32, FP8, INT8, or INT4 based on architecture support, accuracy targets, and deployment constraints
- Validate Numerics: Track loss scaling, overflow behavior, and model accuracy whenever changing precision or quantization settings
- Prefer Library Kernels: Use cuBLAS, cuDNN, CUTLASS-backed libraries, and framework integrations before writing custom WMMA kernels
Further Reading
- NVIDIA A100 Tensor Core GPU specifications
- NVIDIA H100 Tensor Core GPU specifications
- CUDA C++ Programming Guide: WMMA
- Automatic Mixed Precision in PyTorch
- Mixed Precision Training Paper
When to enable Tensor Cores in PyTorch (and when to stay in FP32)
Tensor Cores deliver 5–16× the throughput of CUDA cores for matrix multiplications, but only on the dtypes and shapes they were designed for. The decision is not whether to "turn them on" — they activate automatically when the math matches — it is whether to commit your training or inference path to a dtype the hardware accelerates.
Enable mixed precision (use Tensor Cores) when:
- You train or fine-tune any model with > 100M parameters on Volta or newer (V100, A100, H100, RTX 20-series and later) — wrap the step in
torch.amp.autocast('cuda', dtype=torch.bfloat16)(ortorch.float16withGradScaler). - Your matmul shapes have dimensions divisible by 8 (FP16) or 16 (INT8) — anything else falls back to CUDA cores and the speedup disappears.
- You can use BF16 — it has the dynamic range of FP32 and skips the loss-scaling dance FP16 needs. On A100/H100, default to BF16.
- You are deploying inference with TensorRT, vLLM, or
torch.compile— these tools select Tensor-Core paths automatically; you just need the model in FP16/BF16/INT8.
Stay in FP32 (do not chase Tensor Cores) when:
- The model is small enough that fixed kernel-launch overhead dominates — at < 10M parameters, FP32 may be just as fast in practice.
- You have unresolved numerical issues — FP16 overflows in attention logits or layer norms can mask real bugs. Get the FP32 path stable first, then promote.
- Your hardware is pre-Volta (Pascal P100, Maxwell, Kepler) — there are no Tensor Cores; FP16 just costs you precision without buying speed.
- You need bitwise-reproducible results across runs — Tensor-Core matmuls accumulate in lower precision and can disagree by 1 ULP between runs. Set
torch.backends.cuda.matmul.allow_tf32 = Falseif reproducibility outranks speed.
The pragmatic default on Ampere or Hopper: bf16 with torch.amp.autocast. It gives most of the speed-up, none of the loss-scaling drama, and works across every layer PyTorch wraps for AMP.
Automatic Mixed Precision, end to end
In practice you rarely touch WMMA directly — AMP wires these precision choices into your training loop.
Related concepts
How CUDA contexts, streams, and MPS compare: a context is a per-process container of GPU state, a stream is an in-order queue inside a context, and MPS lets multiple processes share a single GPU concurrently. Three layers, three different problems.
Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.
A CUDA stream is a queue of GPU operations that execute in order. Understanding streams is the difference between a GPU at 30% utilization and one running flat out — they are how kernels and memory copies overlap on real hardware.
Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance
Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.
Deep dive into the fundamental processing unit of modern GPUs - the Streaming Multiprocessor architecture, execution model, and memory hierarchy
