CUDA contexts, streams, and MPS solve three different problems at three different layers. A CUDA context is the per-device container of GPU state — VRAM allocations, loaded kernel modules, page-tables, the stream table itself. A CUDA stream is an in-order queue of operations that runs inside a context. MPS (Multi-Process Service) is a userspace daemon that lets multiple host processes share one GPU concurrently by funneling their work through a single shared context with hardware-level isolation.
The context answers which device am I talking to and what state does it hold for me. The stream answers what's the order and concurrency of my work within a process. MPS answers how do multiple processes share one GPU without serializing.
If you have a single GPU and never call cudaSetDevice, you're using a context whether you realize it or not. Streams are the layer you almost always have to think about explicitly. MPS only enters the picture when multiple processes (training jobs, inference workers, notebooks) need to coexist on one device without time-slicing.
At a glance: which one for which problem?
| Problem | Layer to touch | Why |
|---|---|---|
| GPU is idle between kernel launches | Streams | Overlap copies with kernels, run independent kernels concurrently |
| Adding a second GPU to the same process | Context | One context per device per process |
| Multiple processes hammering one GPU (job scheduler, inference fleet, notebooks) | MPS | Single shared context, processes co-execute instead of time-slicing |
| Unexpected serialization in Nsight Systems | Streams | Stray default-stream call somewhere |
cudaErrorInvalidDevicePointer | Context | Pointer from one device used under another's context |
| Kernel-launch latency in a small-batch service | MPS | Eliminates per-process kernel-launch overhead by sharing a context |
The hierarchy
A host process owns one CUDA driver instance. Each device gets its own context (one primary context per device under the runtime API; multiple contexts under the driver API). Each context owns its own table of streams. Streams own the actual queues of operations.
That hierarchy never changes — what changes is how many of each layer you actually create.
Side-by-side
| Aspect | CUDA Context | CUDA Stream |
|---|---|---|
| What it is | Per-device container of GPU state | In-order queue of operations |
| Scope | Bound to one device | Bound to one context (and therefore one device) |
| Owns | VRAM allocations, loaded kernels, configuration, the stream table | Pending kernel launches, memcpys, events |
| Lifetime | As long as the process holds it (primary context: until app exit) | Created and destroyed on demand |
| API to create | Implicit on first call; explicit via cudaSetDevice (runtime) or cuCtxCreate (driver) | cudaStreamCreate(&stream) |
| How many you typically have | One per device per process | 2–16 per context, depending on workload |
| Concurrency unit | No — context itself doesn't introduce concurrency | Yes — the unit of asynchrony in CUDA |
| Cost to create | Heavyweight (allocates device-side state, ~tens of ms) | Cheap (microseconds) |
| Cost of an extra one | Significant (extra VRAM overhead, slower context switching) | Minimal (a small table entry plus per-op bookkeeping) |
| Cross-process sharing | Default: not shared; with MPS, one MPS server context is shared by many client processes | Never shared across processes |
| Implicit sync semantics | Switching contexts within a thread synchronizes the previously-current context | The default stream synchronizes against all other streams in the context |
| Profiler row in Nsight | Each context appears as a logical group | Each non-default stream gets its own row |
| What you change to overlap work | (nothing — wrong layer) | Move work onto non-default streams |
| What you change to use multiple GPUs | cudaSetDevice(N) (one context per device) | (nothing — streams don't reach across devices) |
When to think about which
A useful split, for "I'm trying to figure out which one I need":
- You're trying to make the GPU run faster → streams. The context isn't the bottleneck; the lack of concurrent execution is.
- You're adding a second GPU → context. Each device gets its own; nothing crosses without P2P or NCCL.
- You're sharing a GPU across processes → context, via MPS. Streams stay process-local.
- You're seeing an unexpected serialization in Nsight Systems → almost always a stray default-stream call. Stream-level issue.
- You're getting
cudaErrorInvalidDevicePointer→ context-level. You allocated memory under context A and tried to use it under context B. - You're seeing high memory usage that doesn't match your allocations → potentially context-level. Per-thread default-stream mode plus many host threads can blow up the per-context state.
The questions people ask in the same breath
Is the default stream the same as the context? No. The context is the container; the default stream is one of (potentially many) streams inside that container. Every context has a default stream — that's how API calls without an explicit stream still work.
Does each stream get its own GPU memory? No. All streams within a context share the same VRAM allocations. The context owns the memory; streams just queue operations that touch it. This is why you can't allocate on stream A and free on stream B without thinking about ordering — the same VRAM, race conditions are real.
Can two contexts share a stream? No. A stream belongs to exactly one context. Two contexts sharing a stream would cross the device boundary, which CUDA's execution model doesn't allow.
If I have two GPUs, do I get two streams or two contexts? Two contexts (one per device), and however many streams you create inside each context. The two devices' streams are completely independent — even with the same name they're different queues running on different hardware.
Is cudaStreamCreate allowed before cudaSetDevice? Yes, but the stream lives in the current context at the time of creation. If you cudaSetDevice(1) later, the previously-created stream is still bound to device 0's context. Set the device first, then create the streams you want on it.
Why is the context "primary" in some docs? The runtime API (the one most people use, cudaMalloc, cudaMemcpy, etc.) lazily creates exactly one primary context per device per process. The driver API (cuCtxCreate, cuCtxPushCurrent, etc.) lets you create multiple non-primary contexts per device, but it's rarely useful and most modern code sticks with primary contexts.
A common confusion: "I added streams but I still have only one context"
That's correct. Adding streams doesn't add contexts. They sit at different layers — adding more streams gives you more concurrency within the same per-device state, which is almost always what you want.
The opposite mistake is rarer but worse: creating multiple non-primary contexts on a single device thinking it'll improve concurrency. It won't. Each context is a heavyweight isolation boundary; switching between them on the same thread is a synchronization point and a performance trap. Stay on the primary context per device, scale concurrency via streams.
Where MPS fits in
MPS sits one level above the context — it's not a fourth thing inside the context/stream hierarchy, it's a server that fronts the context for multiple client processes.
Without MPS, each host process owns its own primary context per device. When two processes both target one GPU, the driver serializes their work by time-slicing the GPU — only one process executes kernels at a time, with context-switch overhead between them.
With MPS:
- A single MPS server process owns one CUDA context per GPU.
- Client processes connect to the server via a Unix socket and submit work that the server queues onto its shared context.
- All clients run concurrently on the GPU. Their kernels execute simultaneously (subject to SM availability), not in time-slice.
- Each client still has its own memory space — MPS provides hardware-level isolation via the Volta+ MPS architecture.
When you'd reach for MPS:
- Multiple small-batch inference replicas on one GPU (Triton, Ray Serve, k8s GPU sharing).
- A notebook fleet where you don't want each user to monopolize a GPU.
- Many short-lived processes paying per-process kernel-launch and CUDA-init overhead.
When you wouldn't:
- A single training job already saturating the GPU. MPS adds no value when one process is compute-bound.
- Multi-GPU jobs where each process pins to its own device (no contention to share).
- Mixed-precision or kernel-tuning where deterministic isolation per process matters more than throughput.
TL;DR
- Context = per-device container of GPU state; you almost always have exactly one per device per process.
- Stream = in-order queue inside a context; you usually want several to overlap copies with kernels.
- MPS = a server process that fronts a single shared context for multiple client processes so they can co-execute on one GPU.
- They're at different layers. Streams nest inside contexts. MPS sits in front of contexts, sharing one across clients.
- For concurrency within a process, change the streams. For multiple GPUs, change the contexts. For multiple processes sharing one GPU, use MPS.
For the deeper dives on each: see CUDA contexts, CUDA streams, and CUDA MPS.
Related concepts
A CUDA stream is a queue of GPU operations that execute in order. Understanding streams is the difference between a GPU at 30% utilization and one running flat out — they are how kernels and memory copies overlap on real hardware.
Explore the concept of CUDA contexts, their role in managing GPU resources, and how they enable parallel execution across multiple CPU threads.
Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.
Understanding character devices, major/minor numbers, and the device file hierarchy created by NVIDIA drivers for GPU access in Linux.
Flynn's Classification explained — SISD, SIMD, MISD, MIMD with interactive architecture explorer, SIMD evolution from MMX to AMX, branch divergence visualization, and workload-architecture throughput comparison.
Automate NVIDIA GPU management in Kubernetes with the GPU Operator. Deploy drivers, device plugins, and monitoring as DaemonSets.
