CUDA Streams: Asynchronous Execution and Concurrency

What is a CUDA Stream?

A CUDA stream is an in-order queue of GPU operations — kernel launches, memory copies, events, and synchronization markers. Operations queued on the same stream execute serially, in the order they were issued. Operations queued on different streams can run in parallel, subject to the GPU's available hardware engines.

That's the entire abstraction. The implications are not.

A stream is the unit of asynchrony in CUDA. If you want a kernel and a memory copy to overlap, they must be on different streams. If you want two kernels to run concurrently on the same device, they must be on different streams. If you want the CPU to keep working while the GPU is busy, you need an async API call — and async API calls are queued on a stream.

The default stream is a trap

Every CUDA process starts with one stream pre-created: the default stream (sometimes called the legacy default stream or stream 0). Any call you make without specifying a stream — cudaMemcpy, myKernel<<<grid, block>>>(...), cublasSgemm — lands there.

The default stream has implicit synchronization semantics: it serializes against every other stream in the process by default. That's the trap. Even if you carefully create non-default streams for some kernels, a single stray cudaMemcpy call in the default stream stalls every concurrent stream until it finishes. Tools like Nsight Systems make this immediately visible: rows of streams running flat out, then one wide gap where everything waits on the default stream.

The fix: never use the default stream for anything that needs to overlap. Either pass an explicit non-default stream to every async call, or compile with --default-stream per-thread (CUDA 7+), which gives each host thread its own non-blocking default stream.

Async semantics

Memory copies and kernel launches have synchronous and asynchronous variants:

streams.cu

// Synchronous — blocks the host until the copy finishes
cudaMemcpy(dst, src, bytes, cudaMemcpyHostToDevice);

// Asynchronous — returns to the host immediately, runs on `stream`
cudaMemcpyAsync(dst, src, bytes, cudaMemcpyHostToDevice, stream);

// Kernel launches are always async w.r.t. the host;
// pass the stream as the fourth chevron argument.
myKernel<<<grid, block, 0, stream>>>(args);

The async variants are what make stream concurrency possible. Two important constraints:

cudaMemcpyAsync requires pinned (page-locked) host memory to actually overlap with kernels. If you pass pageable host memory, CUDA quietly falls back to a synchronous copy through a staging buffer. This is the single most common reason "I added streams and nothing got faster."
Kernel launch is itself async even on the default stream — but the next CPU API call may block on it. Use cudaStreamQuery(stream) or events to check progress without stalling.

Watching streams overlap

The visualization below schedules four work units (each a copy-in / kernel / copy-out) across one, two, or four non-default streams. Switch modes to see how the GPU's separate H→D copy engine, compute SMs, and D→H copy engine fill in when operations live on different streams.

The 50% saving in the 4-stream case isn't a fluke or a hand-wave — it's the standard pipelined-throughput formula. With three independent hardware resources (H→D engine, SM, D→H engine) and operations balanced across them, the steady-state latency for chunk N drops from 3T to T, capped by the longest phase.

Synchronization

Streams are async, so eventually the host needs to know "is it done yet?" CUDA gives you four building blocks, in increasing order of fineness:

cudaDeviceSynchronize() — block the calling host thread until everything on the device finishes. Heavy hammer; rarely the right answer in production code.
cudaStreamSynchronize(stream) — block until that specific stream's queue is empty. Cleaner than device-wide sync.
cudaStreamQuery(stream) — non-blocking; returns cudaSuccess if idle, cudaErrorNotReady otherwise. Use this in a polling loop when the host has other work it can do.
Events — record a marker on stream A with cudaEventRecord(event, streamA), then make stream B wait on it with cudaStreamWaitEvent(streamB, event, 0). Cross-stream dependencies without involving the host.

events.cu

cudaEvent_t kernel_done;
cudaEventCreate(&kernel_done);

myKernel<<<grid, block, 0, streamA>>>(d_data);
cudaEventRecord(kernel_done, streamA);

// streamB will not start the consumer kernel until streamA's
// myKernel finishes — the host never blocks.
cudaStreamWaitEvent(streamB, kernel_done, 0);
consumerKernel<<<grid, block, 0, streamB>>>(d_data);

Events are also how you measure time on the GPU. Record one before the work, one after, then cudaEventElapsedTime gives you the milliseconds — measured on-device, no CPU-clock skew.

What can actually run concurrently

A common misconception is that putting two kernels on different streams guarantees they run in parallel. The hardware enforces a more nuanced rule:

H→D copy uses the host-to-device DMA engine.
D→H copy uses the device-to-host DMA engine (a separate engine on Tesla and most data-center GPUs; some consumer GPUs share it with H→D).
Kernels run on the streaming multiprocessors.

Two operations can run concurrently only if they need different hardware resources — or if they're both kernels and the GPU has spare SM capacity. Two H→D copies on different streams do not overlap with each other; they queue on the single H→D engine. Two large kernels may not actually overlap if the first one already saturates the SMs.

The practical rule: a 3-way pipeline (H→D ∥ K ∥ D→H) is reliable. Concurrent kernels are best-effort.

Stream priorities

CUDA exposes two priority levels: high and normal (cudaStreamCreateWithPriority). High-priority streams aren't preemptive — they don't kill mid-flight kernels — but they do front of the line at the next launch boundary. Useful for latency-sensitive paths (an inference request) running alongside a low-priority background workload (a gradient backfill).

priorities.cu

int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);

cudaStream_t hot;
cudaStreamCreateWithPriority(&hot, cudaStreamNonBlocking, priority_high);

The cudaStreamNonBlocking flag is also worth knowing: it disables the implicit synchronization with the default stream. Combine with per-thread default stream and you get a much more predictable scheduler.

Common pitfalls

A short list of failure modes I've seen ship to production:

Pageable host memory. cudaMemcpyAsync silently degrades to sync if the source isn't pinned. Use cudaMallocHost / cudaHostRegister.
Implicit default-stream sync. One legacy cudaMemcpy (no Async) anywhere in your code path serializes every other stream. Grep for it.
Allocation in the hot path. cudaMalloc and cudaFree are synchronizing by default — they're cheap to call from setup but disastrous inside a streaming inference loop. Use cudaMallocAsync (CUDA 11.2+) or a memory pool.
Too many streams. Each stream has bookkeeping overhead. Past ~8–16 streams the marginal overlap gain is dwarfed by launch latency. Start with 2–4, profile, then decide.
Kernel launch ordering across streams. The CUDA driver issues launches to the GPU sequentially — if stream A enqueues 1000 tiny kernels before stream B's first launch reaches the driver, you'll see serial behavior in the profile that you didn't write in code. Interleave issues from the host side.

Streams in PyTorch

PyTorch wraps CUDA streams in torch.cuda.Stream. The same model: per-device default stream, async ops queued via context manager, events for cross-stream dependencies.

torch_streams.py

import torch

s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()

# Producer on s1
with torch.cuda.stream(s1):
    a = torch.randn(1024, 1024, device='cuda')
    b = a @ a.T

# Consumer on s2 must wait for the producer
s2.wait_stream(s1)
with torch.cuda.stream(s2):
    c = b.softmax(dim=-1)

torch.cuda.synchronize()  # block host until both streams drain

The PyTorch CUDA cache allocator tracks allocations per stream and inserts the right wait events behind the scenes — that's why tensor_on_s1.to(device_on_s2) "just works" without you ever calling cudaStreamWaitEvent. It's also why mixing raw CUDA C++ kernels into PyTorch via custom ops requires you to record an event on the current PyTorch stream after your kernel launches, otherwise downstream PyTorch ops won't wait for it.

When streams are the wrong tool

Streams give you concurrency within a single CUDA context on a single device. They don't help when:

You need multi-process GPU sharing. That's CUDA MPS — a separate mechanism for letting multiple host processes share an SM partition.
You need cross-device parallelism. Each device has its own context; streams on device 0 don't reach device 1. Use NCCL or cudaMemcpyPeerAsync.
The kernel itself is the bottleneck. If a single kernel takes 50 ms and there's nothing else to overlap with, streams won't help. Profile, then optimize the kernel itself.

Streams are how you keep the GPU busy. Get the kernel right first, then use streams to stop wasting time between kernels.