CUDA Context vs CUDA Stream

A CUDA context is the per-device container of GPU state — VRAM allocations, loaded kernel modules, page-tables, the stream table itself. A CUDA stream is an in-order queue of operations that runs inside a context. The context answers which device am I talking to and what state does it hold for me. The stream answers what's the order and concurrency of my work.

If you have a single GPU and never call cudaSetDevice, you're using a context whether you realize it or not. Streams are the layer you almost always have to think about explicitly.

The hierarchy

A host process owns one CUDA driver instance. Each device gets its own context (one primary context per device under the runtime API; multiple contexts under the driver API). Each context owns its own table of streams. Streams own the actual queues of operations.

That hierarchy never changes — what changes is how many of each layer you actually create.

Side-by-side

Aspect	CUDA Context	CUDA Stream
What it is	Per-device container of GPU state	In-order queue of operations
Scope	Bound to one device	Bound to one context (and therefore one device)
Owns	VRAM allocations, loaded kernels, configuration, the stream table	Pending kernel launches, memcpys, events
Lifetime	As long as the process holds it (primary context: until app exit)	Created and destroyed on demand
API to create	Implicit on first call; explicit via `cudaSetDevice` (runtime) or `cuCtxCreate` (driver)	`cudaStreamCreate(&stream)`
How many you typically have	One per device per process	2–16 per context, depending on workload
Concurrency unit	No — context itself doesn't introduce concurrency	Yes — the unit of asynchrony in CUDA
Cost to create	Heavyweight (allocates device-side state, ~tens of ms)	Cheap (microseconds)
Cost of an extra one	Significant (extra VRAM overhead, slower context switching)	Minimal (a small table entry plus per-op bookkeeping)
Cross-process sharing	Default: not shared; with MPS, one MPS server context is shared by many client processes	Never shared across processes
Implicit sync semantics	Switching contexts within a thread synchronizes the previously-current context	The default stream synchronizes against all other streams in the context
Profiler row in Nsight	Each context appears as a logical group	Each non-default stream gets its own row
What you change to overlap work	(nothing — wrong layer)	Move work onto non-default streams
What you change to use multiple GPUs	`cudaSetDevice(N)` (one context per device)	(nothing — streams don't reach across devices)

When to think about which

A useful split, for "I'm trying to figure out which one I need":

You're trying to make the GPU run faster → streams. The context isn't the bottleneck; the lack of concurrent execution is.
You're adding a second GPU → context. Each device gets its own; nothing crosses without P2P or NCCL.
You're sharing a GPU across processes → context, via MPS. Streams stay process-local.
You're seeing an unexpected serialization in Nsight Systems → almost always a stray default-stream call. Stream-level issue.
You're getting cudaErrorInvalidDevicePointer → context-level. You allocated memory under context A and tried to use it under context B.
You're seeing high memory usage that doesn't match your allocations → potentially context-level. Per-thread default-stream mode plus many host threads can blow up the per-context state.

The questions people ask in the same breath

Is the default stream the same as the context? No. The context is the container; the default stream is one of (potentially many) streams inside that container. Every context has a default stream — that's how API calls without an explicit stream still work.

Does each stream get its own GPU memory? No. All streams within a context share the same VRAM allocations. The context owns the memory; streams just queue operations that touch it. This is why you can't allocate on stream A and free on stream B without thinking about ordering — the same VRAM, race conditions are real.

Can two contexts share a stream? No. A stream belongs to exactly one context. Two contexts sharing a stream would cross the device boundary, which CUDA's execution model doesn't allow.

If I have two GPUs, do I get two streams or two contexts? Two contexts (one per device), and however many streams you create inside each context. The two devices' streams are completely independent — even with the same name they're different queues running on different hardware.

Is cudaStreamCreate allowed before cudaSetDevice? Yes, but the stream lives in the current context at the time of creation. If you cudaSetDevice(1) later, the previously-created stream is still bound to device 0's context. Set the device first, then create the streams you want on it.

Why is the context "primary" in some docs? The runtime API (the one most people use, cudaMalloc, cudaMemcpy, etc.) lazily creates exactly one primary context per device per process. The driver API (cuCtxCreate, cuCtxPushCurrent, etc.) lets you create multiple non-primary contexts per device, but it's rarely useful and most modern code sticks with primary contexts.

A common confusion: "I added streams but I still have only one context"

That's correct. Adding streams doesn't add contexts. They sit at different layers — adding more streams gives you more concurrency within the same per-device state, which is almost always what you want.

The opposite mistake is rarer but worse: creating multiple non-primary contexts on a single device thinking it'll improve concurrency. It won't. Each context is a heavyweight isolation boundary; switching between them on the same thread is a synchronization point and a performance trap. Stay on the primary context per device, scale concurrency via streams.

TL;DR

Context = per-device container of GPU state; you almost always have exactly one per device per process.
Stream = in-order queue inside a context; you usually want several to overlap copies with kernels.
They're at different layers of the hierarchy. Streams nest inside contexts.
For concurrency, change the streams. For multiple GPUs, change the contexts.

For the deeper dives on each: see CUDA contexts and CUDA streams.