Skip to main content

CUDA Multi-Process Service (MPS): GPU Sharing for Concurrent Workloads

Complete guide to CUDA MPS — architecture, performance benchmarks vs time-slicing and MIG, thread percentage planning, production deployment with systemd and Kubernetes, profiling with nsys, and troubleshooting.

Best viewed on desktop for optimal interactive experience

What is CUDA Multi-Process Service (MPS)?

CUDA Multi-Process Service (MPS) is a client-server architecture that enables multiple CUDA processes to share a single GPU context, allowing them to submit work concurrently to the GPU and achieve better utilization. Without MPS, CUDA contexts from different processes are time-sliced sequentially, leading to GPU underutilization when individual processes launch small kernels.

MPS eliminates this overhead by multiplexing work from multiple clients through a single server process that manages a shared GPU context.

The Problem: GPU Underutilization

Modern NVIDIA GPUs contain thousands of CUDA cores capable of executing work from multiple kernels simultaneously. However, the default CUDA execution model creates isolation between processes by giving each its own exclusive GPU context.

When multiple processes try to use the GPU, the driver time-slices these contexts—meaning only one process can submit work at a time, and context switches incur significant overhead.

Time-Slicing Issues

Consider a scenario where you have multiple small inference services running—each launches CUDA kernels that use only 20% of the GPU's streaming multiprocessors (SMs):

  • Process A runs its kernel using 20% of GPU → 80% of SMs idle
  • Context switch overhead (~10-100 microseconds)
  • Process B runs its kernel using 20% of GPU → 80% of SMs idle
  • Context switch overhead
  • Process C runs → more idle time

The GPU spends most of its time either idle or switching contexts. With MPS, all three processes submit work concurrently through a shared context, and the GPU scheduler assigns them to different SMs simultaneously—achieving 60% utilization instead of 20%.

Performance: Time-Slicing vs MPS vs MIG

MPS Architecture

MPS operates through a client-server model with three key components:

1. MPS Control Daemon

  • Binary: nvidia-cuda-mps-control
  • Role: Management interface
  • Functions:
    • Start/stop MPS servers
    • Configure per-device settings
    • Handle client connections
    • Manage pipe directories

2. MPS Server

  • Binary: nvidia-cuda-mps-server
  • Role: GPU context owner
  • Functions:
    • Create shared GPU context
    • Multiplex CUDA calls from multiple clients
    • Submit kernels to GPU
    • Manage device memory

3. Client Library

  • Library: libcuda.so (MPS-aware)
  • Role: Transparent interception
  • Functions:
    • Intercept CUDA API calls
    • Route to MPS server via named pipes
    • Handle synchronization
    • Manage client state

How MPS Works

When a CUDA application runs under MPS, the execution flow changes fundamentally:

1. Application Launch

Client process starts and loads the CUDA runtime library (libcuda.so). If MPS environment variables are set, the library detects MPS mode.

2. MPS Connection

The CUDA library connects to the MPS control daemon via named pipes in /tmp/nvidia-mps/ (or path specified by CUDA_MPS_PIPE_DIRECTORY). Control daemon authenticates the client and provides a connection to the appropriate MPS server.

3. Context Initialization

Instead of creating its own GPU context, the client receives a handle to the shared context managed by the MPS server. This is transparent to the application—it still uses standard CUDA API calls.

4. Kernel Launch

When the application calls cudaLaunchKernel(), the CUDA library serializes the kernel parameters and sends them through the pipe to the MPS server. The server queues the work and submits it to the GPU using its shared context.

5. Concurrent Execution

The GPU's hardware scheduler receives kernels from multiple clients (via the single MPS server) and distributes them across available SMs. Kernels from different clients can execute simultaneously if resources permit.

6. Synchronization

When a client calls cudaDeviceSynchronize(), it waits on its own submitted work. The MPS server tracks which kernels belong to which client and signals completion appropriately.

Benefits of MPS

Reduced Overhead

Single GPU context eliminates context switch costs (50-100μs per switch)

Better GPU Utilization

Multiple small kernels can fill the GPU instead of leaving it mostly idle

Transparency

Applications require no code changes—MPS operates at the CUDA driver level

Simplified Management

Control daemon provides centralized administration

Improved Isolation (Volta+)

Hardware improvements on Volta and newer architectures provide better isolation between clients with Address Space Isolation (ASI)

Memory Management

Pre-Volta (Pascal and Earlier): Limited Isolation

On GPUs before Volta (e.g., GTX 1080, Tesla P100), MPS provides minimal process isolation:

  • All clients share the same virtual address space
  • No hardware memory protection
  • A buggy client can corrupt another client's GPU memory
  • Error propagation affects all clients

Recommendation: Only use pre-Volta MPS for trusted workloads or development environments, not multi-tenant production systems.

Volta and Later: Improved Isolation

Starting with Volta architecture (Tesla V100, RTX 2080, etc.), NVIDIA introduced hardware-level improvements:

  • Address Space Isolation (ASI): Each client gets its own GPU virtual address space
  • Memory Protection: Hardware prevents clients from accessing each other's memory
  • Fault Isolation: GPU faults in one client don't crash others (with caveats)
  • Better QoS: Improved scheduling fairness between clients
  • Compute Preemption: Long-running kernels can be preempted to improve responsiveness

Starting MPS

Basic Setup

# Ensure NVIDIA driver is loaded nvidia-smi # Set environment variables (optional, has defaults) export CUDA_VISIBLE_DEVICES=0 export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps export CUDA_MPS_LOG_DIRECTORY=/var/log/nvidia-mps # Start MPS control daemon nvidia-cuda-mps-control -d # Verify MPS is running ps aux | grep mps # You should see: # nvidia-cuda-mps-control # nvidia-cuda-mps-server

Configuration Options

Key Environment Variables:

  • CUDA_VISIBLE_DEVICES: Which GPUs are available to MPS
  • CUDA_MPS_PIPE_DIRECTORY: Location of communication pipes (default: /tmp/nvidia-mps)
  • CUDA_MPS_LOG_DIRECTORY: Location of log files (default: /var/log/nvidia-mps)
  • CUDA_MPS_ACTIVE_THREAD_PERCENTAGE: Max % of device threads per client (0-100, default: 100)
  • CUDA_DEVICE_MAX_CONNECTIONS: Max concurrent streams per device (default: 8)

Planning Thread Allocation

When running multiple MPS clients, use CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to limit each client’s SM access. Without limits, a single greedy client can monopolize the GPU.

Interactive Control

# Interactive mode nvidia-cuda-mps-control nvidia-mps> get_server_list nvidia-mps> get_device_client_list 0 nvidia-mps> set_default_active_thread_percentage 50 nvidia-mps> quit # Non-interactive mode echo "get_server_list" | nvidia-cuda-mps-control # Gracefully stop MPS echo "quit" | nvidia-cuda-mps-control

MPS vs Time-Slicing vs MIG

Time-Slicing (Default)

  • Isolation: Full process isolation
  • Utilization: Poor for small workloads
  • Overhead: High context switch cost
  • Use Case: Single large process or strict isolation needs

MPS (Multi-Process Service)

  • Isolation: Limited on pre-Volta, good on Volta+
  • Utilization: Excellent for small concurrent workloads
  • Overhead: Minimal
  • Use Case: Multiple small inference services, trusted multi-tenant

MIG (Multi-Instance GPU)

  • Isolation: Hardware-enforced partitioning
  • Utilization: Good but partitioned
  • Overhead: None
  • Use Case: Strict multi-tenant isolation (Ampere/Hopper only)

When to Use MPS

Ideal Use Cases

Multiple small inference services sharing a GPU

MPI applications with one process per GPU

Microservices architecture with GPU workloads

Development environments with multiple users

Container orchestration (Kubernetes with NVIDIA device plugin)

Single large training job (no benefit)

Untrusted multi-tenant on pre-Volta GPUs

Applications requiring strict QoS guarantees (use MIG instead)

Workloads with large memory allocations competing for space

Best Practices

  1. Use Volta+ GPUs for production multi-tenant scenarios
  2. Set resource limits via CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to prevent monopolization
  3. Monitor via logs in CUDA_MPS_LOG_DIRECTORY
  4. Use systemd service for production deployments
  5. Test thoroughly before production—behavior varies by workload
  6. Consider MIG for strict isolation requirements on Ampere+

Production Deployment

systemd Service

# /etc/systemd/system/nvidia-mps.service [Unit] Description=NVIDIA CUDA MPS Control Daemon After=nvidia-persistenced.service [Service] Type=forking Environment="CUDA_VISIBLE_DEVICES=0" Environment="CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps" Environment="CUDA_MPS_LOG_DIRECTORY=/var/log/nvidia-mps" ExecStartPre=/bin/mkdir -p /var/log/nvidia-mps ExecStart=/usr/bin/nvidia-cuda-mps-control -d ExecStop=/bin/bash -c 'echo quit | /usr/bin/nvidia-cuda-mps-control' Restart=on-failure RestartSec=5 [Install] WantedBy=multi-user.target
sudo systemctl enable nvidia-mps sudo systemctl start nvidia-mps sudo systemctl status nvidia-mps

Kubernetes with NVIDIA Device Plugin

# DaemonSet for MPS on GPU nodes apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-mps spec: selector: matchLabels: app: nvidia-mps template: spec: nodeSelector: nvidia.com/gpu.present: 'true' containers: - name: mps-daemon image: nvidia/cuda:12.1.0-base-ubuntu22.04 command: ['nvidia-cuda-mps-control', '-f'] env: - name: CUDA_MPS_PIPE_DIRECTORY value: /tmp/nvidia-mps volumeMounts: - name: mps-pipes mountPath: /tmp/nvidia-mps securityContext: privileged: true volumes: - name: mps-pipes hostPath: path: /tmp/nvidia-mps

Container Setup

# Docker: mount the MPS pipe directory docker run --gpus all \ -v /tmp/nvidia-mps:/tmp/nvidia-mps \ -e CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps \ my-inference-server

Profiling MPS Performance

nsys (NVIDIA Nsight Systems)

# Profile without MPS (baseline) echo quit | nvidia-cuda-mps-control 2>/dev/null nsys profile -o baseline ./inference_server & nsys profile -o baseline2 ./inference_server2 & wait # Profile with MPS nvidia-cuda-mps-control -d nsys profile -o with_mps ./inference_server & nsys profile -o with_mps2 ./inference_server2 & wait

In the nsys timeline: without MPS, kernels appear sequentially with gaps. With MPS, kernels overlap.

PyTorch Inference Example

MPS is transparent — no code changes needed:

import torch # Just set env vars before launch: # export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps # export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=25 model = torch.jit.load('model.pt').cuda().eval() with torch.no_grad(): for batch in data_loader: output = model(batch.cuda()) # Kernels submitted through MPS server automatically

Troubleshooting

Key Takeaways

  1. MPS enables concurrent GPU execution — multiple processes share one context, eliminating 50-100µs context switch overhead.

  2. Use Volta+ for production — Address Space Isolation provides memory protection between clients.

  3. CUDA_MPS_ACTIVE_THREAD_PERCENTAGE prevents monopolization — limit each client’s SM access for fair sharing.

  4. No code changes required — MPS operates at the CUDA driver level transparently.

  5. MPS for small kernels, MIG for isolation — MPS maximizes utilization; MIG provides hardware partitioning.

Further Reading

If you found this explanation helpful, consider sharing it with others.

Mastodon