Skip to main content

Tensor Cores Explained: Mixed Precision & Matrix Acceleration

Summary
NVIDIA Tensor Cores explained: architecture-, precision-, and workload-dependent matrix acceleration for AI training and inference on CUDA GPUs.

Overview

Tensor Cores are specialized processing units found in modern NVIDIA GPUs that accelerate matrix multiplication and convolution operations, the fundamental building blocks of deep learning. Introduced with the Volta architecture, Tensor Cores can deliver large speedups for eligible AI workloads by executing hardware-accelerated matrix multiply-accumulate operations at lower precision while preserving higher precision where it matters. The actual gain depends on architecture, precision mode, tensor shapes, memory behavior, and kernel implementation. Their reduced-precision paths are central to modern quantization workflows that shrink model footprints while preserving accuracy.

Unlike traditional CUDA cores that process scalar and vector operations, Tensor Cores operate on matrix fragments cooperatively across a warp, making them useful for the dense matrix computations required in neural network training and inference.

Key Concepts

Mixed Precision Computing

Uses lower precision formats such as FP16, BF16, TF32, INT8, or FP8 for eligible math while retaining FP32 accumulation, parameters, or optimizer state where needed for accuracy

Matrix Multiply-Accumulate

Accelerates D = A x B + C on warp-level matrix fragments. Public fragment shapes vary by architecture, precision, and API rather than being one fixed hardware tile size

Warp-Level Operations

Coordinates a warp of 32 threads, with each thread contributing part of the fragment load, multiply-accumulate, and store sequence

Automatic Mixed Precision

Framework support chooses eligible Tensor Core paths while keeping numerically sensitive operations in safer precision

The precision–alignment playground

Two dials decide whether Tensor Cores fire and how fast: the precision you compute in, and whether your matrix shape lines up with the hardware tile. Try both.

How It Works

1

Fragment Loading

Load matrix fragments from shared or global memory into WMMA fragments

wmma::load_matrix_sync(a_frag, tile_ptr, stride);
2

Matrix Multiply-Accumulate

Perform a warp-level multiply-accumulate on the loaded fragments

wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
3

Result Storage

Store the accumulator fragment back to shared or global memory

wmma::store_matrix_sync(out_ptr, c_frag, stride);
4

Precision Management

Use framework AMP to choose eligible lower-precision kernels and manage gradient scaling

with autocast("cuda", dtype=torch.float16): output = model(input)

Inside one Tensor Core operation

A single Tensor Core instruction is four warp-level steps. Step through them, then see why real kernels overlap the loads instead of running them serially.

Real-World Applications

Deep Learning Training

Accelerate eligible forward and backward GEMM/convolution work in neural network training

BERT training: large gains when AMP, shapes, and kernels align

Large Language Models

Enable high-throughput transformer training and inference

Transformer attention and MLP blocks rely heavily on Tensor Core eligible GEMMs

Computer Vision

Speed up convolution and matrix-heavy CNN operations

ResNet-style models often benefit from AMP plus channels_last layout

Scientific Computing

Accelerate dense linear algebra where reduced precision is acceptable

Iterative solvers can use Tensor Cores for selected matrix kernels

Recommendation Systems

Improve dense projection and MLP throughput around embedding-heavy models

DLRM-style ranking models benefit when dense layers dominate runtime

Quantized Inference

Use INT8, INT4, or FP8 paths on supported architectures after calibration or quantization-aware training

Production inference can trade precision for throughput after accuracy validation

Performance Characteristics

MetricValueNotes
A100 FP16/BF16 Tensor Core312 TFLOPSDense theoretical peak; 624 TFLOPS with sparsity
A100 TF32 Tensor Core156 TFLOPSDense theoretical peak; 312 TFLOPS with sparsity
A100 INT8 Tensor Core624 TOPSInteger inference peak; 1,248 TOPS with sparsity
Memory Bandwidth1.6-2.0 TB/sA100 configuration dependent; memory-bound workloads may not reach compute peak
Observed SpeedupWorkload dependentRequires Tensor Core eligible ops, aligned dimensions, and enough arithmetic intensity

CUDA cores vs Tensor Cores

Why bother? Same GEMM, two engines — race them.

Using Tensor Cores with PyTorch

import torch from torch.amp import autocast, GradScaler # Keep model parameters in their normal dtype. AMP chooses eligible # lower-precision kernels without manually calling model.half(). device = "cuda" model = model.to(device) scaler = GradScaler("cuda") for epoch in range(num_epochs): for batch in dataloader: inputs = batch["input"].to(device, non_blocking=True) targets = batch["target"].to(device, non_blocking=True) optimizer.zero_grad(set_to_none=True) with autocast("cuda", dtype=torch.float16): outputs = model(inputs) loss = criterion(outputs, targets) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() # Explicit Tensor Core eligible matmul with CUDA tensors. # Multiples of 8 or 16 are common alignment targets, depending on # precision, architecture, and library kernel. N = 4096 A = torch.randn((N, N), dtype=torch.float16, device=device) B = torch.randn((N, N), dtype=torch.float16, device=device) C = A @ B

Advantages & Limitations

Advantages

  • Large throughput gains for eligible GEMM and convolution workloads
  • Reduced activation memory with mixed precision training
  • Automatic framework integration through PyTorch AMP and other libraries
  • Maintains model accuracy when autocast and gradient scaling are used correctly
  • Enables larger models or batches on the same hardware
  • Strong power efficiency for supported matrix operations

Limitations

  • ×Requires supported NVIDIA architectures and eligible kernels
  • ×Best performance depends on dimension alignment and memory layout
  • ×Not all layers or workloads are matrix-bound
  • ×Quantized inference requires calibration and accuracy validation
  • ×Manual dtype casts can reduce stability or bypass AMP decisions
  • ×Peak hardware numbers are theoretical and workload dependent

Common Pitfalls to Avoid

!

Misaligned Matrix Dimensions

Matrix and channel sizes that do not align with library kernel requirements can reduce Tensor Core utilization

Solution: Use profiler evidence, pad dimensions where beneficial, and prefer layouts such as channels_last for supported CNN workloads
!

Manual Casting In Training

Calling model.half() or manually casting many tensors can bypass AMP's per-op precision choices

Solution: Use torch.amp.autocast and GradScaler for training unless profiling and validation justify a manual dtype policy
!

Assuming Peak Throughput

Theoretical TFLOPS or TOPS numbers require ideal shapes, precision modes, sparsity support, and compute-bound kernels

Solution: Report measured throughput for the actual model and separate dense from sparsity-enabled peak claims
!

Skipping Tensor Core Profiling

A workload may run correctly with AMP while still falling back to non-Tensor Core kernels

Solution: Use Nsight Compute or framework profiler output to verify Tensor Core utilization and identify bottlenecks

Best Practices

  • Use Automatic Mixed Precision: Start with torch.amp.autocast and GradScaler for training instead of hand-converting the model to FP16
  • Align Dimensions Deliberately: Use multiples of 8 or 16 where the target precision and library kernels benefit, and pad only when measurements support it
  • Profile Tensor Core Utilization: Use Nsight Compute, PyTorch profiler, or vendor library logs to verify Tensor Core kernels and memory bottlenecks
  • Choose Precision By Workload: Use FP16, BF16, TF32, FP8, INT8, or INT4 based on architecture support, accuracy targets, and deployment constraints
  • Validate Numerics: Track loss scaling, overflow behavior, and model accuracy whenever changing precision or quantization settings
  • Prefer Library Kernels: Use cuBLAS, cuDNN, CUTLASS-backed libraries, and framework integrations before writing custom WMMA kernels

Further Reading

When to enable Tensor Cores in PyTorch (and when to stay in FP32)

Tensor Cores deliver 5–16× the throughput of CUDA cores for matrix multiplications, but only on the dtypes and shapes they were designed for. The decision is not whether to "turn them on" — they activate automatically when the math matches — it is whether to commit your training or inference path to a dtype the hardware accelerates.

Enable mixed precision (use Tensor Cores) when:

  • You train or fine-tune any model with > 100M parameters on Volta or newer (V100, A100, H100, RTX 20-series and later) — wrap the step in torch.amp.autocast('cuda', dtype=torch.bfloat16) (or torch.float16 with GradScaler).
  • Your matmul shapes have dimensions divisible by 8 (FP16) or 16 (INT8) — anything else falls back to CUDA cores and the speedup disappears.
  • You can use BF16 — it has the dynamic range of FP32 and skips the loss-scaling dance FP16 needs. On A100/H100, default to BF16.
  • You are deploying inference with TensorRT, vLLM, or torch.compile — these tools select Tensor-Core paths automatically; you just need the model in FP16/BF16/INT8.

Stay in FP32 (do not chase Tensor Cores) when:

  • The model is small enough that fixed kernel-launch overhead dominates — at < 10M parameters, FP32 may be just as fast in practice.
  • You have unresolved numerical issues — FP16 overflows in attention logits or layer norms can mask real bugs. Get the FP32 path stable first, then promote.
  • Your hardware is pre-Volta (Pascal P100, Maxwell, Kepler) — there are no Tensor Cores; FP16 just costs you precision without buying speed.
  • You need bitwise-reproducible results across runs — Tensor-Core matmuls accumulate in lower precision and can disagree by 1 ULP between runs. Set torch.backends.cuda.matmul.allow_tf32 = False if reproducibility outranks speed.

The pragmatic default on Ampere or Hopper: bf16 with torch.amp.autocast. It gives most of the speed-up, none of the loss-scaling drama, and works across every layer PyTorch wraps for AMP.

Automatic Mixed Precision, end to end

In practice you rarely touch WMMA directly — AMP wires these precision choices into your training loop.

GPU & High-Performance Computing
CUDA Context vs Streams vs MPS: Process Isolation, Concurrency, and Multi-Tenancy

How CUDA contexts, streams, and MPS compare: a context is a per-process container of GPU state, a stream is an in-order queue inside a context, and MPS lets multiple processes share a single GPU concurrently. Three layers, three different problems.

GPU & High-Performance Computing
CUDA Multi-Process Service (MPS): GPU Sharing for Concurrent Workloads

Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.

GPU & High-Performance Computing
CUDA Streams: Asynchronous Execution and Concurrency

A CUDA stream is a queue of GPU operations that execute in order. Understanding streams is the difference between a GPU at 30% utilization and one running flat out — they are how kernels and memory copies overlap on real hardware.

GPU & High-Performance Computing
GPU Memory Hierarchy & Optimization

Master GPU memory hierarchy from registers to global memory, understand coalescing patterns, bank conflicts, and optimization strategies for maximum performance

GPU & High-Performance Computing
NVIDIA Device Files in /dev/

Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.

GPU & High-Performance Computing
GPU Streaming Multiprocessor (SM)

Deep dive into the fundamental processing unit of modern GPUs - the Streaming Multiprocessor architecture, execution model, and memory hierarchy

If you found this explanation helpful, consider sharing it with others.

Mastodon