Skip to main content

Understanding CUDA Contexts

Summary
Explore the concept of CUDA contexts, their role in managing GPU resources, and how they enable parallel execution across multiple CPU threads.

What is a CUDA Context?

A CUDA context is essentially a container for all the resources needed to interact with a specific GPU device from a host (CPU) process. Think of it as the GPU's state as seen by a particular CPU process. Each context is associated with one specific device and one specific host process (though a process can manage multiple contexts for multiple devices).

Think of a CUDA Context as a distributed data structure with:

A "control plane" on the CPU that manages and directs operations A "data plane" on the GPU that stores the actual execution state

When you make CUDA API calls, the CPU-side component of the context interprets these calls and sends appropriate commands to update or use the GPU-side components of the context. This dual-residence nature is why contexts are so important - they maintain the synchronized state between host and device that allows them to work together as a cohesive system.

Key Aspects:

  • Resource Management: A context manages GPU resources like memory allocations (device pointers), loaded modules (kernels), streams, and events specific to that context's associated device and process.
  • Isolation: Contexts provide isolation. Resources created within one context are generally not directly accessible from another context, even if they target the same physical device.
  • CPU Thread Association: While a context belongs to a host process, CUDA API calls relating to a context are typically made from specific CPU threads. CUDA maintains a current context per CPU thread, often managed implicitly or explicitly via context stacks (cuCtxPushCurrent/cuCtxPopCurrent).
  • GPU State: It encapsulates the state of the GPU relevant to the host process, including loaded kernels, allocated memory, and configuration settings.

The visualization below illustrates the relationship between CPU threads making API calls, the CUDA contexts they interact with (potentially pushed/popped onto a stack per thread), and the underlying GPU device resources managed by those contexts.

GPU & High-Performance Computing
CUDA Context vs Streams vs MPS: Process Isolation, Concurrency, and Multi-Tenancy

How CUDA contexts, streams, and MPS compare: a context is a per-process container of GPU state, a stream is an in-order queue inside a context, and MPS lets multiple processes share a single GPU concurrently. Three layers, three different problems.

GPU & High-Performance Computing
CUDA Streams: Asynchronous Execution and Concurrency

A CUDA stream is a queue of GPU operations that execute in order. Understanding streams is the difference between a GPU at 30% utilization and one running flat out — they are how kernels and memory copies overlap on real hardware.

GPU & High-Performance Computing
CUDA Multi-Process Service (MPS): GPU Sharing for Concurrent Workloads

Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.

GPU & High-Performance Computing
Flynn's Classification: Taxonomy of Computer Architectures

Flynn's Classification explained — SISD, SIMD, MISD, MIMD with interactive architecture explorer, SIMD evolution from MMX to AMX, branch divergence visualization, and workload-architecture throughput comparison.

GPU & High-Performance Computing
Understanding NVIDIA Kubernetes GPU Operator

Automate NVIDIA GPU management in Kubernetes with the GPU Operator. Deploy drivers, device plugins, and monitoring as DaemonSets.

GPU & High-Performance Computing
NVIDIA Device Files in /dev/

Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.

If you found this explanation helpful, consider sharing it with others.

Mastodon