A CUDA context is the per-device container of GPU state — VRAM allocations, loaded kernel modules, page-tables, the stream table itself. A CUDA stream is an in-order queue of operations that runs inside a context. The context answers which device am I talking to and what state does it hold for me. The stream answers what's the order and concurrency of my work.
If you have a single GPU and never call cudaSetDevice, you're using a context whether you realize it or not. Streams are the layer you almost always have to think about explicitly.
The hierarchy
A host process owns one CUDA driver instance. Each device gets its own context (one primary context per device under the runtime API; multiple contexts under the driver API). Each context owns its own table of streams. Streams own the actual queues of operations.
That hierarchy never changes — what changes is how many of each layer you actually create.
Side-by-side
| Aspect | CUDA Context | CUDA Stream |
|---|---|---|
| What it is | Per-device container of GPU state | In-order queue of operations |
| Scope | Bound to one device | Bound to one context (and therefore one device) |
| Owns | VRAM allocations, loaded kernels, configuration, the stream table | Pending kernel launches, memcpys, events |
| Lifetime | As long as the process holds it (primary context: until app exit) | Created and destroyed on demand |
| API to create | Implicit on first call; explicit via cudaSetDevice (runtime) or cuCtxCreate (driver) | cudaStreamCreate(&stream) |
| How many you typically have | One per device per process | 2–16 per context, depending on workload |
| Concurrency unit | No — context itself doesn't introduce concurrency | Yes — the unit of asynchrony in CUDA |
| Cost to create | Heavyweight (allocates device-side state, ~tens of ms) | Cheap (microseconds) |
| Cost of an extra one | Significant (extra VRAM overhead, slower context switching) | Minimal (a small table entry plus per-op bookkeeping) |
| Cross-process sharing | Default: not shared; with MPS, one MPS server context is shared by many client processes | Never shared across processes |
| Implicit sync semantics | Switching contexts within a thread synchronizes the previously-current context | The default stream synchronizes against all other streams in the context |
| Profiler row in Nsight | Each context appears as a logical group | Each non-default stream gets its own row |
| What you change to overlap work | (nothing — wrong layer) | Move work onto non-default streams |
| What you change to use multiple GPUs | cudaSetDevice(N) (one context per device) | (nothing — streams don't reach across devices) |
When to think about which
A useful split, for "I'm trying to figure out which one I need":
- You're trying to make the GPU run faster → streams. The context isn't the bottleneck; the lack of concurrent execution is.
- You're adding a second GPU → context. Each device gets its own; nothing crosses without P2P or NCCL.
- You're sharing a GPU across processes → context, via MPS. Streams stay process-local.
- You're seeing an unexpected serialization in Nsight Systems → almost always a stray default-stream call. Stream-level issue.
- You're getting
cudaErrorInvalidDevicePointer→ context-level. You allocated memory under context A and tried to use it under context B. - You're seeing high memory usage that doesn't match your allocations → potentially context-level. Per-thread default-stream mode plus many host threads can blow up the per-context state.
The questions people ask in the same breath
Is the default stream the same as the context? No. The context is the container; the default stream is one of (potentially many) streams inside that container. Every context has a default stream — that's how API calls without an explicit stream still work.
Does each stream get its own GPU memory? No. All streams within a context share the same VRAM allocations. The context owns the memory; streams just queue operations that touch it. This is why you can't allocate on stream A and free on stream B without thinking about ordering — the same VRAM, race conditions are real.
Can two contexts share a stream? No. A stream belongs to exactly one context. Two contexts sharing a stream would cross the device boundary, which CUDA's execution model doesn't allow.
If I have two GPUs, do I get two streams or two contexts? Two contexts (one per device), and however many streams you create inside each context. The two devices' streams are completely independent — even with the same name they're different queues running on different hardware.
Is cudaStreamCreate allowed before cudaSetDevice? Yes, but the stream lives in the current context at the time of creation. If you cudaSetDevice(1) later, the previously-created stream is still bound to device 0's context. Set the device first, then create the streams you want on it.
Why is the context "primary" in some docs? The runtime API (the one most people use, cudaMalloc, cudaMemcpy, etc.) lazily creates exactly one primary context per device per process. The driver API (cuCtxCreate, cuCtxPushCurrent, etc.) lets you create multiple non-primary contexts per device, but it's rarely useful and most modern code sticks with primary contexts.
A common confusion: "I added streams but I still have only one context"
That's correct. Adding streams doesn't add contexts. They sit at different layers — adding more streams gives you more concurrency within the same per-device state, which is almost always what you want.
The opposite mistake is rarer but worse: creating multiple non-primary contexts on a single device thinking it'll improve concurrency. It won't. Each context is a heavyweight isolation boundary; switching between them on the same thread is a synchronization point and a performance trap. Stay on the primary context per device, scale concurrency via streams.
TL;DR
- Context = per-device container of GPU state; you almost always have exactly one per device per process.
- Stream = in-order queue inside a context; you usually want several to overlap copies with kernels.
- They're at different layers of the hierarchy. Streams nest inside contexts.
- For concurrency, change the streams. For multiple GPUs, change the contexts.
For the deeper dives on each: see CUDA contexts and CUDA streams.
