Skip to main content

Tensor Cores: Accelerating Deep Learning

NVIDIA Tensor Cores explained: architecture-, precision-, and workload-dependent matrix acceleration for AI training and inference on CUDA GPUs.

Overview

Tensor Cores are specialized processing units found in modern NVIDIA GPUs that accelerate matrix multiplication and convolution operations, the fundamental building blocks of deep learning. Introduced with the Volta architecture, Tensor Cores can deliver large speedups for eligible AI workloads by executing hardware-accelerated matrix multiply-accumulate operations at lower precision while preserving higher precision where it matters. The actual gain depends on architecture, precision mode, tensor shapes, memory behavior, and kernel implementation. Their reduced-precision paths are central to modern quantization workflows that shrink model footprints while preserving accuracy.

Unlike traditional CUDA cores that process scalar and vector operations, Tensor Cores operate on matrix fragments cooperatively across a warp, making them useful for the dense matrix computations required in neural network training and inference.

Key Concepts

Mixed Precision Computing

Uses lower precision formats such as FP16, BF16, TF32, INT8, or FP8 for eligible math while retaining FP32 accumulation, parameters, or optimizer state where needed for accuracy

Matrix Multiply-Accumulate

Accelerates D = A x B + C on warp-level matrix fragments. Public fragment shapes vary by architecture, precision, and API rather than being one fixed hardware tile size

Warp-Level Operations

Coordinates a warp of 32 threads, with each thread contributing part of the fragment load, multiply-accumulate, and store sequence

Automatic Mixed Precision

Framework support chooses eligible Tensor Core paths while keeping numerically sensitive operations in safer precision

How It Works

1

Fragment Loading

Load matrix fragments from shared or global memory into WMMA fragments

wmma::load_matrix_sync(a_frag, tile_ptr, stride);
2

Matrix Multiply-Accumulate

Perform a warp-level multiply-accumulate on the loaded fragments

wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
3

Result Storage

Store the accumulator fragment back to shared or global memory

wmma::store_matrix_sync(out_ptr, c_frag, stride);
4

Precision Management

Use framework AMP to choose eligible lower-precision kernels and manage gradient scaling

with autocast("cuda", dtype=torch.float16): output = model(input)

Tensor Core Operation Visualizer

Real-World Applications

Deep Learning Training

Accelerate eligible forward and backward GEMM/convolution work in neural network training

BERT training: large gains when AMP, shapes, and kernels align

Large Language Models

Enable high-throughput transformer training and inference

Transformer attention and MLP blocks rely heavily on Tensor Core eligible GEMMs

Computer Vision

Speed up convolution and matrix-heavy CNN operations

ResNet-style models often benefit from AMP plus channels_last layout

Scientific Computing

Accelerate dense linear algebra where reduced precision is acceptable

Iterative solvers can use Tensor Cores for selected matrix kernels

Recommendation Systems

Improve dense projection and MLP throughput around embedding-heavy models

DLRM-style ranking models benefit when dense layers dominate runtime

Quantized Inference

Use INT8, INT4, or FP8 paths on supported architectures after calibration or quantization-aware training

Production inference can trade precision for throughput after accuracy validation

Performance Characteristics

MetricValueNotes
A100 FP16/BF16 Tensor Core312 TFLOPSDense theoretical peak; 624 TFLOPS with sparsity
A100 TF32 Tensor Core156 TFLOPSDense theoretical peak; 312 TFLOPS with sparsity
A100 INT8 Tensor Core624 TOPSInteger inference peak; 1,248 TOPS with sparsity
Memory Bandwidth1.6-2.0 TB/sA100 configuration dependent; memory-bound workloads may not reach compute peak
Observed SpeedupWorkload dependentRequires Tensor Core eligible ops, aligned dimensions, and enough arithmetic intensity

Using Tensor Cores with PyTorch

import torch from torch.amp import autocast, GradScaler # Keep model parameters in their normal dtype. AMP chooses eligible # lower-precision kernels without manually calling model.half(). device = "cuda" model = model.to(device) scaler = GradScaler("cuda") for epoch in range(num_epochs): for batch in dataloader: inputs = batch["input"].to(device, non_blocking=True) targets = batch["target"].to(device, non_blocking=True) optimizer.zero_grad(set_to_none=True) with autocast("cuda", dtype=torch.float16): outputs = model(inputs) loss = criterion(outputs, targets) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() # Explicit Tensor Core eligible matmul with CUDA tensors. # Multiples of 8 or 16 are common alignment targets, depending on # precision, architecture, and library kernel. N = 4096 A = torch.randn((N, N), dtype=torch.float16, device=device) B = torch.randn((N, N), dtype=torch.float16, device=device) C = A @ B

Advantages & Limitations

Advantages

  • Large throughput gains for eligible GEMM and convolution workloads
  • Reduced activation memory with mixed precision training
  • Automatic framework integration through PyTorch AMP and other libraries
  • Maintains model accuracy when autocast and gradient scaling are used correctly
  • Enables larger models or batches on the same hardware
  • Strong power efficiency for supported matrix operations

Limitations

  • ×Requires supported NVIDIA architectures and eligible kernels
  • ×Best performance depends on dimension alignment and memory layout
  • ×Not all layers or workloads are matrix-bound
  • ×Quantized inference requires calibration and accuracy validation
  • ×Manual dtype casts can reduce stability or bypass AMP decisions
  • ×Peak hardware numbers are theoretical and workload dependent

Common Pitfalls to Avoid

!

Misaligned Matrix Dimensions

Matrix and channel sizes that do not align with library kernel requirements can reduce Tensor Core utilization

Solution: Use profiler evidence, pad dimensions where beneficial, and prefer layouts such as channels_last for supported CNN workloads
!

Manual Casting In Training

Calling model.half() or manually casting many tensors can bypass AMP's per-op precision choices

Solution: Use torch.amp.autocast and GradScaler for training unless profiling and validation justify a manual dtype policy
!

Assuming Peak Throughput

Theoretical TFLOPS or TOPS numbers require ideal shapes, precision modes, sparsity support, and compute-bound kernels

Solution: Report measured throughput for the actual model and separate dense from sparsity-enabled peak claims
!

Skipping Tensor Core Profiling

A workload may run correctly with AMP while still falling back to non-Tensor Core kernels

Solution: Use Nsight Compute or framework profiler output to verify Tensor Core utilization and identify bottlenecks

Best Practices

  • Use Automatic Mixed Precision: Start with torch.amp.autocast and GradScaler for training instead of hand-converting the model to FP16
  • Align Dimensions Deliberately: Use multiples of 8 or 16 where the target precision and library kernels benefit, and pad only when measurements support it
  • Profile Tensor Core Utilization: Use Nsight Compute, PyTorch profiler, or vendor library logs to verify Tensor Core kernels and memory bottlenecks
  • Choose Precision By Workload: Use FP16, BF16, TF32, FP8, INT8, or INT4 based on architecture support, accuracy targets, and deployment constraints
  • Validate Numerics: Track loss scaling, overflow behavior, and model accuracy whenever changing precision or quantization settings
  • Prefer Library Kernels: Use cuBLAS, cuDNN, CUTLASS-backed libraries, and framework integrations before writing custom WMMA kernels

Further Reading

If you found this explanation helpful, consider sharing it with others.

Mastodon