CUDA Context vs Streams vs MPS: Process Isolation, Concurrency, and Multi-Tenancy

CUDA contexts, streams, and MPS solve three different problems at three different layers. A CUDA context is the per-device container of GPU state — VRAM allocations, loaded kernel modules, page-tables, the stream table itself. A CUDA stream is an in-order queue of operations that runs inside a context. MPS (Multi-Process Service) is a userspace daemon that lets multiple host processes share one GPU concurrently by funneling their work through a single shared context with hardware-level isolation.

The context answers which device am I talking to and what state does it hold for me. The stream answers what's the order and concurrency of my work within a process. MPS answers how do multiple processes share one GPU without serializing.

If you have a single GPU and never call cudaSetDevice, you're using a context whether you realize it or not. Streams are the layer you almost always have to think about explicitly. MPS only enters the picture when multiple processes (training jobs, inference workers, notebooks) need to coexist on one device without time-slicing.

At a glance: which one for which problem?

Problem	Layer to touch	Why
GPU is idle between kernel launches	Streams	Overlap copies with kernels, run independent kernels concurrently
Adding a second GPU to the same process	Context	One context per device per process
Multiple processes hammering one GPU (job scheduler, inference fleet, notebooks)	MPS	Single shared context, processes co-execute instead of time-slicing
Unexpected serialization in Nsight Systems	Streams	Stray default-stream call somewhere
`cudaErrorInvalidDevicePointer`	Context	Pointer from one device used under another's context
Kernel-launch latency in a small-batch service	MPS	Eliminates per-process kernel-launch overhead by sharing a context

The hierarchy

A host process owns one CUDA driver instance. Each device gets its own context (one primary context per device under the runtime API; multiple contexts under the driver API). Each context owns its own table of streams. Streams own the actual queues of operations.

That hierarchy never changes — what changes is how many of each layer you actually create.

Side-by-side

Aspect	CUDA Context	CUDA Stream
What it is	Per-device container of GPU state	In-order queue of operations
Scope	Bound to one device	Bound to one context (and therefore one device)
Owns	VRAM allocations, loaded kernels, configuration, the stream table	Pending kernel launches, memcpys, events
Lifetime	As long as the process holds it (primary context: until app exit)	Created and destroyed on demand
API to create	Implicit on first call; explicit via `cudaSetDevice` (runtime) or `cuCtxCreate` (driver)	`cudaStreamCreate(&stream)`
How many you typically have	One per device per process	2–16 per context, depending on workload
Concurrency unit	No — context itself doesn't introduce concurrency	Yes — the unit of asynchrony in CUDA
Cost to create	Heavyweight (allocates device-side state, ~tens of ms)	Cheap (microseconds)
Cost of an extra one	Significant (extra VRAM overhead, slower context switching)	Minimal (a small table entry plus per-op bookkeeping)
Cross-process sharing	Default: not shared; with MPS, one MPS server context is shared by many client processes	Never shared across processes
Implicit sync semantics	Switching contexts within a thread synchronizes the previously-current context	The default stream synchronizes against all other streams in the context
Profiler row in Nsight	Each context appears as a logical group	Each non-default stream gets its own row
What you change to overlap work	(nothing — wrong layer)	Move work onto non-default streams
What you change to use multiple GPUs	`cudaSetDevice(N)` (one context per device)	(nothing — streams don't reach across devices)

When to think about which

A useful split, for "I'm trying to figure out which one I need":

You're trying to make the GPU run faster → streams. The context isn't the bottleneck; the lack of concurrent execution is.
You're adding a second GPU → context. Each device gets its own; nothing crosses without P2P or NCCL.
You're sharing a GPU across processes → context, via MPS. Streams stay process-local.
You're seeing an unexpected serialization in Nsight Systems → almost always a stray default-stream call. Stream-level issue.
You're getting cudaErrorInvalidDevicePointer → context-level. You allocated memory under context A and tried to use it under context B.
You're seeing high memory usage that doesn't match your allocations → potentially context-level. Per-thread default-stream mode plus many host threads can blow up the per-context state.

The questions people ask in the same breath

Is the default stream the same as the context? No. The context is the container; the default stream is one of (potentially many) streams inside that container. Every context has a default stream — that's how API calls without an explicit stream still work.

Does each stream get its own GPU memory? No. All streams within a context share the same VRAM allocations. The context owns the memory; streams just queue operations that touch it. This is why you can't allocate on stream A and free on stream B without thinking about ordering — the same VRAM, race conditions are real.

Can two contexts share a stream? No. A stream belongs to exactly one context. Two contexts sharing a stream would cross the device boundary, which CUDA's execution model doesn't allow.

If I have two GPUs, do I get two streams or two contexts? Two contexts (one per device), and however many streams you create inside each context. The two devices' streams are completely independent — even with the same name they're different queues running on different hardware.

Is cudaStreamCreate allowed before cudaSetDevice? Yes, but the stream lives in the current context at the time of creation. If you cudaSetDevice(1) later, the previously-created stream is still bound to device 0's context. Set the device first, then create the streams you want on it.

Why is the context "primary" in some docs? The runtime API (the one most people use, cudaMalloc, cudaMemcpy, etc.) lazily creates exactly one primary context per device per process. The driver API (cuCtxCreate, cuCtxPushCurrent, etc.) lets you create multiple non-primary contexts per device, but it's rarely useful and most modern code sticks with primary contexts.

A common confusion: "I added streams but I still have only one context"

That's correct. Adding streams doesn't add contexts. They sit at different layers — adding more streams gives you more concurrency within the same per-device state, which is almost always what you want.

The opposite mistake is rarer but worse: creating multiple non-primary contexts on a single device thinking it'll improve concurrency. It won't. Each context is a heavyweight isolation boundary; switching between them on the same thread is a synchronization point and a performance trap. Stay on the primary context per device, scale concurrency via streams.

Where MPS fits in

MPS sits one level above the context — it's not a fourth thing inside the context/stream hierarchy, it's a server that fronts the context for multiple client processes.

Without MPS, each host process owns its own primary context per device. When two processes both target one GPU, the driver serializes their work by time-slicing the GPU — only one process executes kernels at a time, with context-switch overhead between them.

With MPS:

A single MPS server process owns one CUDA context per GPU.
Client processes connect to the server via a Unix socket and submit work that the server queues onto its shared context.
All clients run concurrently on the GPU. Their kernels execute simultaneously (subject to SM availability), not in time-slice.
Each client still has its own memory space — MPS provides hardware-level isolation via the Volta+ MPS architecture.

When you'd reach for MPS:

Multiple small-batch inference replicas on one GPU (Triton, Ray Serve, k8s GPU sharing).
A notebook fleet where you don't want each user to monopolize a GPU.
Many short-lived processes paying per-process kernel-launch and CUDA-init overhead.

When you wouldn't:

A single training job already saturating the GPU. MPS adds no value when one process is compute-bound.
Multi-GPU jobs where each process pins to its own device (no contention to share).
Mixed-precision or kernel-tuning where deterministic isolation per process matters more than throughput.

TL;DR

Context = per-device container of GPU state; you almost always have exactly one per device per process.
Stream = in-order queue inside a context; you usually want several to overlap copies with kernels.
MPS = a server process that fronts a single shared context for multiple client processes so they can co-execute on one GPU.
They're at different layers. Streams nest inside contexts. MPS sits in front of contexts, sharing one across clients.
For concurrency within a process, change the streams. For multiple GPUs, change the contexts. For multiple processes sharing one GPU, use MPS.

For the deeper dives on each: see CUDA contexts, CUDA streams, and CUDA MPS.

GPU & High-Performance Computing

CUDA Streams: Asynchronous Execution and Concurrency

A CUDA stream is a queue of GPU operations that execute in order. Understanding streams is the difference between a GPU at 30% utilization and one running flat out — they are how kernels and memory copies overlap on real hardware.

GPU & High-Performance Computing

Understanding CUDA Contexts

Explore the concept of CUDA contexts, their role in managing GPU resources, and how they enable parallel execution across multiple CPU threads.

GPU & High-Performance Computing

CUDA Multi-Process Service (MPS): GPU Sharing for Concurrent Workloads

Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.

GPU & High-Performance Computing

NVIDIA Device Files in /dev/

Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.

GPU & High-Performance Computing

NVIDIA vs AMD for Deep Learning: CUDA vs ROCm and the Datacenter Accelerators

NVIDIA vs AMD for deep learning compared at both layers: the CUDA vs ROCm software moat, the microarchitecture (warp vs wavefront, SM vs CU, Tensor vs Matrix Cores), and the datacenter accelerators (H100/H200/B200 vs MI300X/MI325X).

GPU & High-Performance Computing

Flynn's Classification: Taxonomy of Computer Architectures

Flynn's Classification explained — SISD, SIMD, MISD, MIMD with interactive architecture explorer, SIMD evolution from MMX to AMX, branch divergence visualization, and workload-architecture throughput comparison.