GPU Resources in Slurm
GPUs in Slurm are managed as Generic Resources (GRES). Unlike CPUs and memory which Slurm tracks natively, GPUs must be explicitly requested. If you don’t ask for GPUs, your job won’t see any — even if the node has eight A100s sitting idle.
The --gres Flag
Basic GPU Request
Request N GPUs of any type:
#SBATCH --gres=gpu:2
Specific GPU Types
If your cluster has mixed GPU hardware, specify the type:
#SBATCH --gres=gpu:a100:4 # 4x A100 #SBATCH --gres=gpu:v100:2 # 2x V100
Slurm matches your request to nodes that have the right GPU type. If no matching nodes have capacity, the job stays PENDING.
Check Available GPUs
Run sinfo -o "%N %G" to see which nodes have which GPU types. Don’t
guess — requesting a GPU type that doesn’t exist queues your job
forever with no error.
CUDA_VISIBLE_DEVICES Mapping
When Slurm allocates GPUs, it sets CUDA_VISIBLE_DEVICES to control which GPUs each task can see. This is the critical connection between Slurm’s allocation and your CUDA code.
If a node has GPUs 0-7 and Slurm gives your task GPUs 2 and 5, your process sees CUDA_VISIBLE_DEVICES=2,5. PyTorch’s torch.cuda.device(0) maps to physical GPU 2, and torch.cuda.device(1) maps to physical GPU 5.
The Remapping Trap
The most confusing aspect of CUDA_VISIBLE_DEVICES is that it remaps physical GPU IDs to logical device indices. Your code always sees devices starting from 0, regardless of which physical GPUs were allocated.
This remapping means:
torch.cuda.device(0)always refers to the first allocated GPU, not physical GPU 0- Setting
CUDA_VISIBLE_DEVICESmanually in your script overrides Slurm’s allocation — don’t do this - If Slurm gives you GPUs 2,5,7 on a node,
nvidia-smiinside your job shows them as devices 0,1,2
GPU Topology and Binding
On multi-GPU nodes, not all GPU pairs communicate at the same speed. GPUs connected via NVLink exchange data at 600 GB/s, while GPUs on separate PCIe switches communicate at 32 GB/s. This 18x difference makes GPU placement critical for distributed training.
The --gpu-bind Flag
Slurm can pin tasks to GPUs with optimal CPU affinity:
#SBATCH --gpu-bind=closest # Bind each task to GPUs on the same NUMA node #SBATCH --gpu-bind=map_gpu:0,1,2,3 # Explicit GPU-to-task mapping
closest is the right default for most workloads — it ensures the CPU cores running your code are on the same NUMA node as the GPUs, avoiding cross-socket memory access penalties.
Checking Topology
# See GPU interconnect topology nvidia-smi topo -m # Output: # GPU0 GPU1 GPU2 GPU3 # GPU0 X NV12 NV12 NV12 # GPU1 NV12 X NV12 NV12 # GPU2 NV12 NV12 X NV12 # GPU3 NV12 NV12 NV12 X # NV12 = NVLink 12 lanes (bidirectional)
Multi-Node GPU Training
Two dominant patterns exist for launching distributed training on Slurm. The choice depends on your training framework.
Pattern 1: One Task Per GPU (srun)
Each task is a separate process with exactly one GPU. Slurm handles process placement. Works with PyTorch’s torch.distributed.launch style.
#!/bin/bash #SBATCH --job-name=ddp-training #SBATCH --partition=gpu #SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 #SBATCH --gres=gpu:a100:4 #SBATCH --gpus-per-task=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=128G #SBATCH --time=24:00:00 #SBATCH --output=logs/%x_%j.out #SBATCH --error=logs/%x_%j.err set -euo pipefail module load cuda/12.1 source activate myenv export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n1) export MASTER_PORT=$((29500 + RANDOM % 100)) export NCCL_SOCKET_IFNAME=ib0 srun python train.py \ --world_size $SLURM_NTASKS \ --rank $SLURM_PROCID \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT
Each of the 8 tasks (4 per node × 2 nodes) gets one GPU with CUDA_VISIBLE_DEVICES set automatically.
Pattern 2: One Task Per Node (torchrun)
Each node runs one task that internally spawns GPU workers using torchrun. The node task sees all allocated GPUs.
#!/bin/bash #SBATCH --job-name=fsdp-training #SBATCH --partition=gpu #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --gres=gpu:a100:4 #SBATCH --gpus-per-node=4 #SBATCH --cpus-per-task=32 #SBATCH --mem=256G #SBATCH --time=48:00:00 #SBATCH --output=logs/%x_%j.out #SBATCH --error=logs/%x_%j.err set -euo pipefail module load cuda/12.1 source activate myenv export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n1) export MASTER_PORT=$((29500 + RANDOM % 100)) export NCCL_SOCKET_IFNAME=ib0 srun torchrun \ --nproc_per_node=$SLURM_GPUS_ON_NODE \ --nnodes=$SLURM_NNODES \ --node_rank=$SLURM_NODEID \ --rdzv_backend=c10d \ --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \ train.py
Don't Mix the Patterns
Using --gpus-per-task=1 with torchrun causes each spawned worker to only
see one GPU. Using --gpus-per-node with srun direct launch means every task
sees all GPUs and competes for them. Match your flag to your launch pattern.
Generate Your Own Script
Use the builder below to generate a complete, production-ready job script for your specific setup:
NCCL Configuration
NCCL (NVIDIA Collective Communications Library) handles GPU-to-GPU communication in distributed training. On Slurm clusters, you often need to set environment variables for NCCL to find the right network interface:
# Use InfiniBand if available export NCCL_IB_DISABLE=0 export NCCL_NET_GDR_LEVEL=5 # Or specify the network interface explicitly export NCCL_SOCKET_IFNAME=ib0 # Enable debug logging for troubleshooting export NCCL_DEBUG=INFO
The MASTER_ADDR for distributed training should resolve to the first node in your allocation. Slurm provides this via SLURM_NODELIST:
export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n1) export MASTER_PORT=29500
MIG: Multi-Instance GPU
NVIDIA A100 and H100 GPUs support Multi-Instance GPU (MIG), which partitions a single physical GPU into up to 7 independent instances. Each instance has dedicated compute, memory, and cache — complete hardware isolation.
Requesting MIG Profiles
# Request 1 MIG instance with 1 compute slice and 5GB memory #SBATCH --gres=gpu:a100_1g.5gb:1 # Request 2 MIG instances with 2 compute slices and 10GB each #SBATCH --gres=gpu:a100_2g.10gb:2 # Available MIG profiles: # # A100-40GB: A100-80GB / H100: # 1g.5gb — 1/7 of GPU 1g.10gb — 1/7 of GPU # 2g.10gb — 2/7 of GPU 2g.20gb — 2/7 of GPU # 3g.20gb — 3/7 of GPU 3g.40gb — 3/7 of GPU # 4g.20gb — 4/7 of GPU 4g.40gb — 4/7 of GPU # 7g.40gb — Full GPU 7g.80gb — Full GPU
MIG and CUDA_VISIBLE_DEVICES
With MIG, CUDA_VISIBLE_DEVICES uses a different format: MIG-UUID instead
of integer IDs. Your CUDA code doesn’t need changes — the driver
handles the translation. But nvidia-smi commands need the -i MIG-UUID
flag.
When to Use MIG
- Inference serving: Pack multiple models on one GPU with guaranteed isolation
- Interactive development: Give each user a slice instead of a whole GPU
- Small experiments: Hyperparameter sweeps where each run fits in 5-10GB
- DO NOT use for: Distributed training that needs GPU-to-GPU communication (MIG instances can’t use NVLink with each other)
GPU Memory and Resource Flags
CPU and Memory Per GPU
Slurm provides flags to tie CPU and memory allocation to GPU count:
#SBATCH --cpus-per-gpu=8 # 8 CPU cores per GPU (for data loading) #SBATCH --mem-per-gpu=32G # 32GB CPU memory per GPU
These are often better than --cpus-per-task and --mem because they scale automatically — request 4 GPUs and you get 32 cores and 128GB without calculating.
VRAM Is Not Managed by Slurm
Slurm allocates physical GPUs but does not track GPU memory (VRAM). If your model needs 40GB and you get an A100-40GB, there’s no Slurm error — you get an OOM at runtime.
# Check GPU memory before submitting sinfo -o "%N %G %m" -p gpu # Monitor VRAM during job srun --jobid=$SLURM_JOB_ID nvidia-smi --query-gpu=memory.used,memory.total --format=csv -l 5
PyTorch Memory Settings
# Combine multiple settings with commas (separate exports overwrite each other!) export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512 # expandable_segments:True — reduces fragmentation for large models # max_split_size_mb:512 — limits allocation block size, leaves room for NCCL buffers
Common Pitfalls
1. Not requesting GPUs at all
Your job lands on a GPU node but CUDA_VISIBLE_DEVICES is empty. PyTorch falls back to CPU. Always include --gres=gpu:N.
2. Mismatching tasks and GPUs
With --ntasks-per-node=8 and --gres=gpu:4, half your tasks have no GPU. Match tasks to available GPUs or use --gpus-per-task=1.
3. Hardcoding GPU count in training code
If your script has torch.cuda.device_count() returning 4 but Slurm only allocated 2, you get CUDA errors. Always use SLURM_GPUS_ON_NODE or let the framework auto-detect.
4. Forgetting MASTER_ADDR for multi-node
Without a master address, distributed training hangs. Extract it from SLURM_NODELIST in your job script.
Debugging GPU Allocation
Verify Your Allocation
# Check what GPUs your job got scontrol show job $SLURM_JOB_ID | grep -i gres # See CUDA_VISIBLE_DEVICES inside the job echo $CUDA_VISIBLE_DEVICES # Full GPU status nvidia-smi
Common Error Messages
"CUDA error: no kernel image is available for execution on the device"
You compiled for the wrong GPU architecture. Check nvidia-smi for the actual GPU, then set TORCH_CUDA_ARCH_LIST correctly.
"NCCL error: unhandled system error / timeout"
Network issue between nodes. Check: (1) NCCL_SOCKET_IFNAME matches your cluster’s network interface, (2) firewall isn’t blocking the master port, (3) MASTER_ADDR resolves correctly.
"RuntimeError: CUDA out of memory" Your model exceeds the GPU’s VRAM. Solutions: reduce batch size, enable gradient checkpointing, use FSDP/DeepSpeed for model sharding, or request a GPU with more memory.
Job stuck in PENDING with reason "ReqNodeNotAvail"
The GPU type you requested doesn’t exist or all matching nodes are down. Run sinfo -o "%N %G %T" -p gpu to check available GPU nodes.
CUDA_VISIBLE_DEVICES is empty
You forgot --gres=gpu:N. Without it, Slurm doesn’t allocate GPUs even on GPU nodes.
Key Takeaways
-
Always use --gres — GPUs are not allocated by default. No --gres = no GPUs, even on GPU nodes.
-
Two launch patterns — gpus-per-task + srun (1 process per GPU) or gpus-per-node + torchrun (1 process per node).
-
CUDA_VISIBLE_DEVICES is set by Slurm — don’t set it manually in your script. Let Slurm handle the mapping.
-
Configure NCCL for your network — set NCCL_SOCKET_IFNAME and MASTER_ADDR for multi-node communication to work.
Related Concepts
- Slurm Fundamentals: Core commands and job lifecycle
- Slurm Resource Management: Priority, fair-share, and monitoring
- Distributed Parallelism: Data parallel and model parallel strategies
- Multi-GPU Communication: NCCL collectives (AllReduce, AllGather)
- NCCL Communication: Deep dive into GPU-to-GPU communication primitives
Further Reading
- Slurm Generic Resource (GRES) Scheduling - Official documentation on GPU and other generic resource allocation
- Multi-GPU Training with Slurm - PyTorch's guide to distributed training on Slurm clusters
- NCCL Environment Variables - NVIDIA's reference for NCCL configuration used in multi-node training
