Your ML training container needs GPU access. But containers are supposed to be isolated — they have their own filesystem, their own process tree, their own network. How does a containerized process talk to physical GPU hardware?
The answer is surprisingly simple once you understand mount namespaces. GPU access is fundamentally a filesystem problem: applications talk to GPUs through device files, and the container runtime makes those files visible by bind-mounting them into the container’s mount namespace.
How Linux Exposes GPUs
Device Files Are the Interface
GPUs don’t have a special API. They appear to userspace as device files in /dev/, just like disks, terminals, and random number generators. The NVIDIA kernel driver (nvidia.ko) creates these character devices when it loads:
$ ls -la /dev/nvidia* crw-rw-rw- 1 root root 195, 0 Mar 14 10:00 /dev/nvidia0 crw-rw-rw- 1 root root 195, 1 Mar 14 10:00 /dev/nvidia1 crw-rw-rw- 1 root root 195, 255 Mar 14 10:00 /dev/nvidiactl crw-rw-rw- 1 root root 510, 0 Mar 14 10:00 /dev/nvidia-uvm crw-rw-rw- 1 root root 510, 1 Mar 14 10:00 /dev/nvidia-uvm-tools
When a CUDA program runs, it doesn’t talk to the GPU directly. It opens /dev/nvidia0 (or whichever GPU) and issues ioctl() syscalls through that file descriptor. The kernel routes those calls to the registered NVIDIA kernel driver, which actually communicates with the hardware. So GPU access = file access.
GPU Device Files
How GPUs appear as device files and how containers gain access
Host /dev/
Container /dev/
No NVIDIA device files visible
Container cannot see GPU hardware — no NVIDIA device files exist in the container's mount namespace.
What NVIDIA Container Runtime Does
A bare container has no GPU access. Its /dev/ directory contains only standard devices (null, zero, pts). The NVIDIA container runtime (nvidia-container-runtime) solves this by injecting three things into the container before the application starts:
- Device nodes — bind mounts
/dev/nvidia0,/dev/nvidiactl,/dev/nvidia-uvmfrom host - Driver libraries — bind mounts
libcuda.so,libnvidia-ml.soand other driver-matched libraries - Device permissions — configures the cgroup device controller to allow access to NVIDIA device major numbers
All three are required. Device files alone aren’t enough (CUDA needs libcuda.so). Libraries alone aren’t enough (they need the device files to issue ioctl() calls). And even with both, the cgroup device controller must explicitly allow access.
GPU Bind Mount Process
What nvidia-container-runtime Does
Step 0: Bare Container
The OCI runtime creates the container with standard devices. No GPU access yet.
Why Bind Mounts?
Bind mounts are mount namespace operations. They make a file or directory from one location appear at another location — even across namespace boundaries. The NVIDIA runtime bind-mounts host files into the container’s mount namespace, so the container sees them as if they were part of its own filesystem. No copying, no overhead — it’s the same file, just visible from a different mount table.
The Two Things Called “NVIDIA Driver”
This is where most confusion happens. People say “NVIDIA driver” to mean two completely different things.
Kernel Driver (nvidia.ko)
The real driver. It’s a kernel module loaded on the host via modprobe nvidia. It runs in kernel space, talks directly to the GPU over PCIe, manages hardware resources, and handles memory allocation. Only one version can exist per kernel — you can’t run two different versions of nvidia.ko simultaneously.
$ lsmod | grep nvidia nvidia_uvm 1503232 0 nvidia_drm 77824 0 nvidia_modeset 1306624 1 nvidia_drm nvidia 56446976 2 nvidia_uvm,nvidia_modeset
All containers share this kernel driver because all containers share the host kernel. This is a fundamental property of Linux containers.
User-Space Libraries (libcuda.so)
These are not drivers. They’re client libraries that talk to the kernel driver through device files. libcuda.so is the CUDA driver API — it opens /dev/nvidia0 and issues ioctl() calls. libcudart.so is the CUDA runtime API that most applications use. libcublas.so, libcudnn.so, and others are higher-level libraries built on top.
Different containers can ship different versions of these libraries. The full call stack looks like:
PyTorch ↓ libcudart.so (CUDA Runtime — lives in container) ↓ libcuda.so (CUDA Driver API — bind-mounted from host) ↓ /dev/nvidia0 (device file — bind-mounted from host) ↓ ioctl() syscall — crosses user/kernel boundary nvidia.ko (kernel driver — host kernel) ↓ GPU Hardware
Kernel Driver vs User-Space Libraries
How the GPU software stack is split between container and host
CUDA Version Compatibility
Because the kernel driver is shared but CUDA libraries can differ per container, version compatibility matters. The rule is forward compatibility: newer host drivers support older CUDA toolkits, but not the reverse.
A host running NVIDIA driver 550 can serve containers using CUDA 11.8, 12.0, 12.1, or 12.4. But a host running driver 535 cannot serve a container using CUDA 12.4 — the container’s libcuda.so would try to call kernel driver APIs that don’t exist in the older driver.
CUDA Version Compatibility Matrix
CUDA Forward Compatibility
| Host Driver ↓ / Container CUDA →Driver / CUDA | CUDA 11.8 | CUDA 12.0 | CUDA 12.1 | CUDA 12.4 |
|---|---|---|---|---|
| Driver 535 | ||||
| Driver 545 | ||||
| Driver 550 | ||||
| Driver 555 |
Rule of thumb: upgrade the host driver, not the container's CUDA. A newer host driver is backward-compatible with all older CUDA toolkit versions.
The Golden Rule
You cannot run a newer CUDA toolkit than your host driver supports. The container’s CUDA libraries call into the host kernel driver — if the driver is too old, those calls fail with “CUDA driver version is insufficient.” Upgrade the host driver, not the container.
GPU Access Methods Compared
There are several ways to give containers GPU access. They differ significantly in security and production-readiness.
GPU Access Methods Compared
Comparing container GPU passthrough approaches across isolation, security, usability, and production readiness.
| Method | Isolation | Security | Ease of Use | Production Ready |
|---|---|---|---|---|
NVIDIA Container Toolkit (--gpus all) | GoodOnly requested GPUs exposed via resource flags | GoodMinimal permissions, cgroup device controller enforced | GoodAutomatic device + library mounting, just add --gpus | Yes |
--device /dev/nvidia0 | ModerateManual device selection, must specify each device | ModerateDevice exposed but libraries must be manually mounted | PoorMust know device paths, manually mount driver libraries | No |
--privileged | NoneFull access to ALL host devices | NoneAll capabilities granted, all devices accessible, effectively root on host | GoodEverything just works, no configuration needed | No |
Kubernetes Device Plugin | GoodResource requests allocate specific GPUs per pod | GoodPlugin manages device permissions and node scheduling | GoodDeclarative: resources.limits.nvidia.com/gpu: 1 | Yes |
NVIDIA Container Toolkit (--gpus all)
--device /dev/nvidia0
--privileged
Kubernetes Device Plugin
NVIDIA Container Toolkit for Docker, Kubernetes Device Plugin for K8s. Both handle device mounting, library injection, and cgroup permissions automatically.
--privileged gives the container full host access. --device alone misses library mounting and is fragile across driver updates.
Common Pitfalls
nvidia-smi works but CUDA fails
nvidia-smi uses libnvidia-ml.so to query the management interface. CUDA uses libcuda.so to submit compute work. They’re different code paths. If nvidia-smi shows your GPU but CUDA programs fail, the CUDA libraries are missing or mismatched — check that libcuda.so is properly mounted and its version matches the host kernel driver.
“No CUDA-capable device is detected”
The device files aren’t visible in the container. Either the runtime didn’t inject them (missing --gpus flag) or the cgroup device controller is blocking access. Check ls /dev/nvidia* inside the container.
Driver version mismatch
The container expects a newer CUDA version than the host driver supports. Check nvidia-smi on the host for the driver version, then verify it meets the minimum requirement for the container’s CUDA toolkit.
Container can’t see GPU after host driver update
After updating the host kernel driver, existing containers still have the old libcuda.so bind-mounted. Restart the container to pick up the new library version.
Key Takeaways
GPUs are device files — applications access them
by opening /dev/nvidia0 and issuing ioctl()
calls through the file descriptor.
GPU in containers = bind mounts — the NVIDIA runtime mounts device files, driver libraries, and sets cgroup permissions.
Kernel driver is shared — all containers use the
same nvidia.ko because they share the host kernel. One
version per host.
User-space libraries can differ — each container can ship its own CUDA toolkit version as long as the host driver supports it.
Forward compatibility only — newer drivers support older CUDA, never the reverse. Upgrade the host driver, not the container.
Three things are needed — device nodes, driver libraries, and cgroup device permissions. Missing any one of them breaks GPU access.
Related Concepts
- Linux Namespaces: Mount namespaces enable the bind-mount mechanism that exposes GPUs to containers
- Containers Under the Hood: How namespaces + cgroups combine to create containers
- Kernel Modules: How nvidia.ko is loaded and managed
- cgroups: Device controller that permits or blocks GPU access
