CUDA Multi-Process Service (MPS): GPU Sharing for Concurrent Workloads

What is CUDA Multi-Process Service (MPS)?

CUDA Multi-Process Service (MPS) is a client-server architecture that enables multiple CUDA processes to share a single GPU context, allowing them to submit work concurrently to the GPU and achieve better utilization. Without MPS, CUDA contexts from different processes are time-sliced sequentially, leading to GPU underutilization when individual processes launch small kernels.

MPS eliminates this overhead by multiplexing work from multiple clients through a single server process that manages a shared GPU context.

Choose the right GPU sharing mode

MPS is the utilization lever. MIG is the isolation lever. Default time-slicing is the lowest-operations fallback.

Use MPS

Best for: Many small trusted CUDA processes
Signal: Each process leaves SMs idle, and overlap can improve utilization.
Avoid when: Strict tenant isolation or kernels that already fill the GPU.

Use MIG

Best for: Hard partitions and stronger isolation
Signal: You need predictable slices for separate services or tenants.
Avoid when: Older GPUs without MIG support or workloads that need the full GPU.

Use default time-slicing

Best for: Single large jobs or simple isolation
Signal: One process dominates the GPU, or concurrency is not worth managing.
Avoid when: Many small kernels where context switches and idle SMs dominate.

Primary goal

Utilization

Isolation

Simplicity

Resource model

Shared context

Hardware slices

Separate contexts

Best workload

Small kernels

Multi-tenant services

One large job

The Problem: GPU Underutilization

Modern NVIDIA GPUs contain thousands of CUDA cores capable of executing work from multiple kernels simultaneously. However, the default CUDA execution model creates isolation between processes by giving each its own exclusive GPU context.

When multiple processes try to use the GPU, the driver time-slices these contexts—meaning only one process can submit work at a time, and context switches incur significant overhead.

Time-Slicing Issues

Consider a scenario where you have multiple small inference services running—each launches CUDA kernels that use only 20% of the GPU's streaming multiprocessors (SMs):

Process A runs its kernel using 20% of GPU → 80% of SMs idle
Context switch overhead (~10-100 microseconds)
Process B runs its kernel using 20% of GPU → 80% of SMs idle
Context switch overhead
Process C runs → more idle time

The GPU spends most of its time either idle or switching contexts. With MPS, all three processes submit work concurrently through a shared context, and the GPU scheduler assigns them to different SMs simultaneously—achieving 60% utilization instead of 20%.

Performance: Time-Slicing vs MPS vs MIG

MPS Architecture

MPS operates through a client-server model with three key components:

1. MPS Control Daemon

Binary: nvidia-cuda-mps-control
Role: Management interface
Functions:
- Start/stop MPS servers
- Configure per-device settings
- Handle client connections
- Manage pipe directories

2. MPS Server

Binary: nvidia-cuda-mps-server
Role: GPU context owner
Functions:
- Create shared GPU context
- Multiplex CUDA calls from multiple clients
- Submit kernels to GPU
- Manage device memory

3. Client Library

Library: libcuda.so (MPS-aware)
Role: Transparent interception
Functions:
- Intercept CUDA API calls
- Route to MPS server via named pipes
- Handle synchronization
- Manage client state

How MPS Works

When a CUDA application runs under MPS, the execution flow changes fundamentally:

1. Application Launch

Client process starts and loads the CUDA runtime library (libcuda.so). If MPS environment variables are set, the library detects MPS mode.

2. MPS Connection

The CUDA library connects to the MPS control daemon via named pipes in /tmp/nvidia-mps/ (or path specified by CUDA_MPS_PIPE_DIRECTORY). Control daemon authenticates the client and provides a connection to the appropriate MPS server.

3. Context Initialization

Instead of creating its own GPU context, the client receives a handle to the shared context managed by the MPS server. This is transparent to the application—it still uses standard CUDA API calls.

4. Kernel Launch

When the application calls cudaLaunchKernel(), the CUDA library serializes the kernel parameters and sends them through the pipe to the MPS server. The server queues the work and submits it to the GPU using its shared context.

5. Concurrent Execution

The GPU's hardware scheduler receives kernels from multiple clients (via the single MPS server) and distributes them across available SMs. Kernels from different clients can execute simultaneously if resources permit.

6. Synchronization

When a client calls cudaDeviceSynchronize(), it waits on its own submitted work. The MPS server tracks which kernels belong to which client and signals completion appropriately.

Benefits of MPS

Reduced Overhead

Single GPU context eliminates context switch costs (50-100μs per switch)

Better GPU Utilization

Multiple small kernels can fill the GPU instead of leaving it mostly idle

Transparency

Applications require no code changes—MPS operates at the CUDA driver level

Simplified Management

Control daemon provides centralized administration

Improved Isolation (Volta+)

Hardware improvements on Volta and newer architectures provide better isolation between clients with Address Space Isolation (ASI)

Memory Management

Pre-Volta (Pascal and Earlier): Limited Isolation

On GPUs before Volta (e.g., GTX 1080, Tesla P100), MPS provides minimal process isolation:

All clients share the same virtual address space
No hardware memory protection
A buggy client can corrupt another client's GPU memory
Error propagation affects all clients

Recommendation: Only use pre-Volta MPS for trusted workloads or development environments, not multi-tenant production systems.

Volta and Later: Improved Isolation

Starting with Volta architecture (Tesla V100, RTX 2080, etc.), NVIDIA introduced hardware-level improvements:

Address Space Isolation (ASI): Each client gets its own GPU virtual address space
Memory Protection: Hardware prevents clients from accessing each other's memory
Fault Isolation: GPU faults in one client don't crash others (with caveats)
Better QoS: Improved scheduling fairness between clients
Compute Preemption: Long-running kernels can be preempted to improve responsiveness

Starting MPS

Basic Setup

# Ensure NVIDIA driver is loaded
nvidia-smi

# Set environment variables (optional, has defaults)
export CUDA_VISIBLE_DEVICES=0
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/var/log/nvidia-mps

# Start MPS control daemon
nvidia-cuda-mps-control -d

# Verify MPS is running
ps aux | grep mps
# You should see:
# nvidia-cuda-mps-control
# nvidia-cuda-mps-server

Configuration Options

Key Environment Variables:

CUDA_VISIBLE_DEVICES: Which GPUs are available to MPS
CUDA_MPS_PIPE_DIRECTORY: Location of communication pipes (default: /tmp/nvidia-mps)
CUDA_MPS_LOG_DIRECTORY: Location of log files (default: /var/log/nvidia-mps)
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE: Max % of device threads per client (0-100, default: 100)
CUDA_DEVICE_MAX_CONNECTIONS: Max concurrent streams per device (default: 8)

Planning Thread Allocation

When running multiple MPS clients, use CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to limit each client’s SM access. Without limits, a single greedy client can monopolize the GPU.

Interactive Control

# Interactive mode
nvidia-cuda-mps-control
nvidia-mps> get_server_list
nvidia-mps> get_device_client_list 0
nvidia-mps> set_default_active_thread_percentage 50
nvidia-mps> quit

# Non-interactive mode
echo "get_server_list" | nvidia-cuda-mps-control

# Gracefully stop MPS
echo "quit" | nvidia-cuda-mps-control

MPS vs Time-Slicing vs MIG

Time-Slicing (Default)

Isolation: Full process isolation
Utilization: Poor for small workloads
Overhead: High context switch cost
Use Case: Single large process or strict isolation needs

MPS (Multi-Process Service)

Isolation: Limited on pre-Volta, good on Volta+
Utilization: Excellent for small concurrent workloads
Overhead: Minimal
Use Case: Multiple small inference services, trusted multi-tenant

MIG (Multi-Instance GPU)

Isolation: Hardware-enforced partitioning
Utilization: Good but partitioned
Overhead: None
Use Case: Strict multi-tenant isolation (Ampere/Hopper only)

When to Use MPS

Ideal Use Cases

✅ Multiple small inference services sharing a GPU

✅ MPI applications with one process per GPU

✅ Microservices architecture with GPU workloads

✅ Development environments with multiple users

✅ Container orchestration (Kubernetes with NVIDIA device plugin)

Not Recommended For

❌ Single large training job (no benefit)

❌ Untrusted multi-tenant on pre-Volta GPUs

❌ Applications requiring strict QoS guarantees (use MIG instead)

❌ Workloads with large memory allocations competing for space

Best Practices

Use Volta+ GPUs for production multi-tenant scenarios
Set resource limits via CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to prevent monopolization
Monitor via logs in CUDA_MPS_LOG_DIRECTORY
Use systemd service for production deployments
Test thoroughly before production—behavior varies by workload
Consider MIG for strict isolation requirements on Ampere+

Production Deployment

systemd Service

# /etc/systemd/system/nvidia-mps.service
[Unit]
Description=NVIDIA CUDA MPS Control Daemon
After=nvidia-persistenced.service

[Service]
Type=forking
Environment="CUDA_VISIBLE_DEVICES=0"
Environment="CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps"
Environment="CUDA_MPS_LOG_DIRECTORY=/var/log/nvidia-mps"
ExecStartPre=/bin/mkdir -p /var/log/nvidia-mps
ExecStart=/usr/bin/nvidia-cuda-mps-control -d
ExecStop=/bin/bash -c 'echo quit | /usr/bin/nvidia-cuda-mps-control'
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

sudo systemctl enable nvidia-mps
sudo systemctl start nvidia-mps
sudo systemctl status nvidia-mps

Kubernetes with NVIDIA Device Plugin

# DaemonSet for MPS on GPU nodes
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-mps
spec:
  selector:
    matchLabels:
      app: nvidia-mps
  template:
    spec:
      nodeSelector:
        nvidia.com/gpu.present: 'true'
      containers:
        - name: mps-daemon
          image: nvidia/cuda:12.1.0-base-ubuntu22.04
          command: ['nvidia-cuda-mps-control', '-f']
          env:
            - name: CUDA_MPS_PIPE_DIRECTORY
              value: /tmp/nvidia-mps
          volumeMounts:
            - name: mps-pipes
              mountPath: /tmp/nvidia-mps
          securityContext:
            privileged: true
      volumes:
        - name: mps-pipes
          hostPath:
            path: /tmp/nvidia-mps

Container Setup

# Docker: mount the MPS pipe directory
docker run --gpus all \
  -v /tmp/nvidia-mps:/tmp/nvidia-mps \
  -e CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps \
  my-inference-server

Profiling MPS Performance

nsys (NVIDIA Nsight Systems)

# Profile without MPS (baseline)
echo quit | nvidia-cuda-mps-control 2>/dev/null
nsys profile -o baseline ./inference_server &
nsys profile -o baseline2 ./inference_server2 &
wait

# Profile with MPS
nvidia-cuda-mps-control -d
nsys profile -o with_mps ./inference_server &
nsys profile -o with_mps2 ./inference_server2 &
wait

In the nsys timeline: without MPS, kernels appear sequentially with gaps. With MPS, kernels overlap.

PyTorch Inference Example

MPS is transparent — no code changes needed:

import torch

# Just set env vars before launch:
# export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
# export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=25

model = torch.jit.load('model.pt').cuda().eval()

with torch.no_grad():
    for batch in data_loader:
        output = model(batch.cuda())
        # Kernels submitted through MPS server automatically