Overview
Tensor Cores are specialized processing units found in modern NVIDIA GPUs that accelerate matrix multiplication and convolution operations, the fundamental building blocks of deep learning. Introduced with the Volta architecture, Tensor Cores can deliver large speedups for eligible AI workloads by executing hardware-accelerated matrix multiply-accumulate operations at lower precision while preserving higher precision where it matters. The actual gain depends on architecture, precision mode, tensor shapes, memory behavior, and kernel implementation. Their reduced-precision paths are central to modern quantization workflows that shrink model footprints while preserving accuracy.
Unlike traditional CUDA cores that process scalar and vector operations, Tensor Cores operate on matrix fragments cooperatively across a warp, making them useful for the dense matrix computations required in neural network training and inference.
Key Concepts
Mixed Precision Computing
Uses lower precision formats such as FP16, BF16, TF32, INT8, or FP8 for eligible math while retaining FP32 accumulation, parameters, or optimizer state where needed for accuracy
Matrix Multiply-Accumulate
Accelerates D = A x B + C on warp-level matrix fragments. Public fragment shapes vary by architecture, precision, and API rather than being one fixed hardware tile size
Warp-Level Operations
Coordinates a warp of 32 threads, with each thread contributing part of the fragment load, multiply-accumulate, and store sequence
Automatic Mixed Precision
Framework support chooses eligible Tensor Core paths while keeping numerically sensitive operations in safer precision
How It Works
Fragment Loading
Load matrix fragments from shared or global memory into WMMA fragments
wmma::load_matrix_sync(a_frag, tile_ptr, stride);Matrix Multiply-Accumulate
Perform a warp-level multiply-accumulate on the loaded fragments
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);Result Storage
Store the accumulator fragment back to shared or global memory
wmma::store_matrix_sync(out_ptr, c_frag, stride);Precision Management
Use framework AMP to choose eligible lower-precision kernels and manage gradient scaling
with autocast("cuda", dtype=torch.float16): output = model(input)Tensor Core Operation Visualizer
Real-World Applications
Deep Learning Training
Accelerate eligible forward and backward GEMM/convolution work in neural network training
Large Language Models
Enable high-throughput transformer training and inference
Computer Vision
Speed up convolution and matrix-heavy CNN operations
Scientific Computing
Accelerate dense linear algebra where reduced precision is acceptable
Recommendation Systems
Improve dense projection and MLP throughput around embedding-heavy models
Quantized Inference
Use INT8, INT4, or FP8 paths on supported architectures after calibration or quantization-aware training
Performance Characteristics
| Metric | Value | Notes |
|---|---|---|
| A100 FP16/BF16 Tensor Core | 312 TFLOPS | Dense theoretical peak; 624 TFLOPS with sparsity |
| A100 TF32 Tensor Core | 156 TFLOPS | Dense theoretical peak; 312 TFLOPS with sparsity |
| A100 INT8 Tensor Core | 624 TOPS | Integer inference peak; 1,248 TOPS with sparsity |
| Memory Bandwidth | 1.6-2.0 TB/s | A100 configuration dependent; memory-bound workloads may not reach compute peak |
| Observed Speedup | Workload dependent | Requires Tensor Core eligible ops, aligned dimensions, and enough arithmetic intensity |
Using Tensor Cores with PyTorch
import torch from torch.amp import autocast, GradScaler # Keep model parameters in their normal dtype. AMP chooses eligible # lower-precision kernels without manually calling model.half(). device = "cuda" model = model.to(device) scaler = GradScaler("cuda") for epoch in range(num_epochs): for batch in dataloader: inputs = batch["input"].to(device, non_blocking=True) targets = batch["target"].to(device, non_blocking=True) optimizer.zero_grad(set_to_none=True) with autocast("cuda", dtype=torch.float16): outputs = model(inputs) loss = criterion(outputs, targets) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() # Explicit Tensor Core eligible matmul with CUDA tensors. # Multiples of 8 or 16 are common alignment targets, depending on # precision, architecture, and library kernel. N = 4096 A = torch.randn((N, N), dtype=torch.float16, device=device) B = torch.randn((N, N), dtype=torch.float16, device=device) C = A @ B
Advantages & Limitations
Advantages
- ✓Large throughput gains for eligible GEMM and convolution workloads
- ✓Reduced activation memory with mixed precision training
- ✓Automatic framework integration through PyTorch AMP and other libraries
- ✓Maintains model accuracy when autocast and gradient scaling are used correctly
- ✓Enables larger models or batches on the same hardware
- ✓Strong power efficiency for supported matrix operations
Limitations
- ×Requires supported NVIDIA architectures and eligible kernels
- ×Best performance depends on dimension alignment and memory layout
- ×Not all layers or workloads are matrix-bound
- ×Quantized inference requires calibration and accuracy validation
- ×Manual dtype casts can reduce stability or bypass AMP decisions
- ×Peak hardware numbers are theoretical and workload dependent
Common Pitfalls to Avoid
Misaligned Matrix Dimensions
Matrix and channel sizes that do not align with library kernel requirements can reduce Tensor Core utilization
Manual Casting In Training
Calling model.half() or manually casting many tensors can bypass AMP's per-op precision choices
Assuming Peak Throughput
Theoretical TFLOPS or TOPS numbers require ideal shapes, precision modes, sparsity support, and compute-bound kernels
Skipping Tensor Core Profiling
A workload may run correctly with AMP while still falling back to non-Tensor Core kernels
Best Practices
- Use Automatic Mixed Precision: Start with torch.amp.autocast and GradScaler for training instead of hand-converting the model to FP16
- Align Dimensions Deliberately: Use multiples of 8 or 16 where the target precision and library kernels benefit, and pad only when measurements support it
- Profile Tensor Core Utilization: Use Nsight Compute, PyTorch profiler, or vendor library logs to verify Tensor Core kernels and memory bottlenecks
- Choose Precision By Workload: Use FP16, BF16, TF32, FP8, INT8, or INT4 based on architecture support, accuracy targets, and deployment constraints
- Validate Numerics: Track loss scaling, overflow behavior, and model accuracy whenever changing precision or quantization settings
- Prefer Library Kernels: Use cuBLAS, cuDNN, CUTLASS-backed libraries, and framework integrations before writing custom WMMA kernels
