Slurm GPU Allocation for Distributed Training

GPU Resources in Slurm

GPUs in Slurm are managed as Generic Resources (GRES). Unlike CPUs and memory which Slurm tracks natively, GPUs must be explicitly requested. If you don’t ask for GPUs, your job won’t see any — even if the node has eight A100s sitting idle.

The --gres Flag

Basic GPU Request

Request N GPUs of any type:

#SBATCH --gres=gpu:2

Specific GPU Types

If your cluster has mixed GPU hardware, specify the type:

#SBATCH --gres=gpu:a100:4    # 4x A100
#SBATCH --gres=gpu:v100:2    # 2x V100

Slurm matches your request to nodes that have the right GPU type. If no matching nodes have capacity, the job stays PENDING.

Check Available GPUs

Run sinfo -o "%N %G" to see which nodes have which GPU types. Don’t guess — requesting a GPU type that doesn’t exist queues your job forever with no error.

CUDA_VISIBLE_DEVICES Mapping

When Slurm allocates GPUs, it sets CUDA_VISIBLE_DEVICES to control which GPUs each task can see. This is the critical connection between Slurm’s allocation and your CUDA code.

If a node has GPUs 0-7 and Slurm gives your task GPUs 2 and 5, your process sees CUDA_VISIBLE_DEVICES=2,5. PyTorch’s torch.cuda.device(0) maps to physical GPU 2, and torch.cuda.device(1) maps to physical GPU 5.

The Remapping Trap

The most confusing aspect of CUDA_VISIBLE_DEVICES is that it remaps physical GPU IDs to logical device indices. Your code always sees devices starting from 0, regardless of which physical GPUs were allocated.

This remapping means:

torch.cuda.device(0) always refers to the first allocated GPU, not physical GPU 0
Setting CUDA_VISIBLE_DEVICES manually in your script overrides Slurm’s allocation — don’t do this
If Slurm gives you GPUs 2,5,7 on a node, nvidia-smi inside your job shows them as devices 0,1,2

GPU Topology and Binding

On multi-GPU nodes, not all GPU pairs communicate at the same speed. GPUs connected via NVLink exchange data at 600 GB/s, while GPUs on separate PCIe switches communicate at 32 GB/s. This 18x difference makes GPU placement critical for distributed training.

The --gpu-bind Flag

Slurm can pin tasks to GPUs with optimal CPU affinity:

#SBATCH --gpu-bind=closest    # Bind each task to GPUs on the same NUMA node
#SBATCH --gpu-bind=map_gpu:0,1,2,3  # Explicit GPU-to-task mapping

closest is the right default for most workloads — it ensures the CPU cores running your code are on the same NUMA node as the GPUs, avoiding cross-socket memory access penalties.

Checking Topology

# See GPU interconnect topology
nvidia-smi topo -m

# Output:
#         GPU0  GPU1  GPU2  GPU3
# GPU0     X    NV12  NV12  NV12
# GPU1    NV12   X    NV12  NV12
# GPU2    NV12  NV12   X    NV12
# GPU3    NV12  NV12  NV12   X
# NV12 = NVLink 12 lanes (bidirectional)

Multi-Node GPU Training

Two dominant patterns exist for launching distributed training on Slurm. The choice depends on your training framework.

Pattern 1: One Task Per GPU (srun)

Each task is a separate process with exactly one GPU. Slurm handles process placement. Works with PyTorch’s torch.distributed.launch style.

#!/bin/bash
#SBATCH --job-name=ddp-training
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:a100:4
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=128G
#SBATCH --time=24:00:00
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err

set -euo pipefail

module load cuda/12.1
source activate myenv

export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n1)
export MASTER_PORT=$((29500 + RANDOM % 100))
export NCCL_SOCKET_IFNAME=ib0

srun python train.py \
  --world_size $SLURM_NTASKS \
  --rank $SLURM_PROCID \
  --master_addr $MASTER_ADDR \
  --master_port $MASTER_PORT

Each of the 8 tasks (4 per node × 2 nodes) gets one GPU with CUDA_VISIBLE_DEVICES set automatically.

Pattern 2: One Task Per Node (torchrun)

Each node runs one task that internally spawns GPU workers using torchrun. The node task sees all allocated GPUs.

#!/bin/bash
#SBATCH --job-name=fsdp-training
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:a100:4
#SBATCH --gpus-per-node=4
#SBATCH --cpus-per-task=32
#SBATCH --mem=256G
#SBATCH --time=48:00:00
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err

set -euo pipefail

module load cuda/12.1
source activate myenv

export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n1)
export MASTER_PORT=$((29500 + RANDOM % 100))
export NCCL_SOCKET_IFNAME=ib0

srun torchrun \
  --nproc_per_node=$SLURM_GPUS_ON_NODE \
  --nnodes=$SLURM_NNODES \
  --node_rank=$SLURM_NODEID \
  --rdzv_backend=c10d \
  --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
  train.py

Don't Mix the Patterns

Using --gpus-per-task=1 with torchrun causes each spawned worker to only see one GPU. Using --gpus-per-node with srun direct launch means every task sees all GPUs and competes for them. Match your flag to your launch pattern.

Generate Your Own Script

Use the builder below to generate a complete, production-ready job script for your specific setup:

NCCL Configuration

NCCL (NVIDIA Collective Communications Library) handles GPU-to-GPU communication in distributed training. On Slurm clusters, you often need to set environment variables for NCCL to find the right network interface:

# Use InfiniBand if available
export NCCL_IB_DISABLE=0
export NCCL_NET_GDR_LEVEL=5

# Or specify the network interface explicitly
export NCCL_SOCKET_IFNAME=ib0

# Enable debug logging for troubleshooting
export NCCL_DEBUG=INFO

The MASTER_ADDR for distributed training should resolve to the first node in your allocation. Slurm provides this via SLURM_NODELIST:

export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n1)
export MASTER_PORT=29500

MIG: Multi-Instance GPU

NVIDIA A100 and H100 GPUs support Multi-Instance GPU (MIG), which partitions a single physical GPU into up to 7 independent instances. Each instance has dedicated compute, memory, and cache — complete hardware isolation.

Requesting MIG Profiles

# Request 1 MIG instance with 1 compute slice and 5GB memory
#SBATCH --gres=gpu:a100_1g.5gb:1

# Request 2 MIG instances with 2 compute slices and 10GB each
#SBATCH --gres=gpu:a100_2g.10gb:2

# Available MIG profiles:
#
# A100-40GB:                    A100-80GB / H100:
# 1g.5gb  — 1/7 of GPU         1g.10gb  — 1/7 of GPU
# 2g.10gb — 2/7 of GPU         2g.20gb  — 2/7 of GPU
# 3g.20gb — 3/7 of GPU         3g.40gb  — 3/7 of GPU
# 4g.20gb — 4/7 of GPU         4g.40gb  — 4/7 of GPU
# 7g.40gb — Full GPU            7g.80gb  — Full GPU

MIG and CUDA_VISIBLE_DEVICES

With MIG, CUDA_VISIBLE_DEVICES uses a different format: MIG-UUID instead of integer IDs. Your CUDA code doesn’t need changes — the driver handles the translation. But nvidia-smi commands need the -i MIG-UUID flag.

When to Use MIG

Inference serving: Pack multiple models on one GPU with guaranteed isolation
Interactive development: Give each user a slice instead of a whole GPU
Small experiments: Hyperparameter sweeps where each run fits in 5-10GB
DO NOT use for: Distributed training that needs GPU-to-GPU communication (MIG instances can’t use NVLink with each other)

GPU Memory and Resource Flags

CPU and Memory Per GPU

Slurm provides flags to tie CPU and memory allocation to GPU count:

#SBATCH --cpus-per-gpu=8     # 8 CPU cores per GPU (for data loading)
#SBATCH --mem-per-gpu=32G    # 32GB CPU memory per GPU

These are often better than --cpus-per-task and --mem because they scale automatically — request 4 GPUs and you get 32 cores and 128GB without calculating.

VRAM Is Not Managed by Slurm

Slurm allocates physical GPUs but does not track GPU memory (VRAM). If your model needs 40GB and you get an A100-40GB, there’s no Slurm error — you get an OOM at runtime.

# Check GPU memory before submitting
sinfo -o "%N %G %m" -p gpu

# Monitor VRAM during job
srun --jobid=$SLURM_JOB_ID nvidia-smi --query-gpu=memory.used,memory.total --format=csv -l 5

PyTorch Memory Settings

# Combine multiple settings with commas (separate exports overwrite each other!)
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512

# expandable_segments:True — reduces fragmentation for large models
# max_split_size_mb:512    — limits allocation block size, leaves room for NCCL buffers

Common Pitfalls

1. Not requesting GPUs at all

Your job lands on a GPU node but CUDA_VISIBLE_DEVICES is empty. PyTorch falls back to CPU. Always include --gres=gpu:N.

2. Mismatching tasks and GPUs

With --ntasks-per-node=8 and --gres=gpu:4, half your tasks have no GPU. Match tasks to available GPUs or use --gpus-per-task=1.

3. Hardcoding GPU count in training code

If your script has torch.cuda.device_count() returning 4 but Slurm only allocated 2, you get CUDA errors. Always use SLURM_GPUS_ON_NODE or let the framework auto-detect.

4. Forgetting MASTER_ADDR for multi-node

Without a master address, distributed training hangs. Extract it from SLURM_NODELIST in your job script.

Debugging GPU Allocation

Verify Your Allocation

# Check what GPUs your job got
scontrol show job $SLURM_JOB_ID | grep -i gres

# See CUDA_VISIBLE_DEVICES inside the job
echo $CUDA_VISIBLE_DEVICES

# Full GPU status
nvidia-smi

Common Error Messages

"CUDA error: no kernel image is available for execution on the device" You compiled for the wrong GPU architecture. Check nvidia-smi for the actual GPU, then set TORCH_CUDA_ARCH_LIST correctly.

"NCCL error: unhandled system error / timeout" Network issue between nodes. Check: (1) NCCL_SOCKET_IFNAME matches your cluster’s network interface, (2) firewall isn’t blocking the master port, (3) MASTER_ADDR resolves correctly.

"RuntimeError: CUDA out of memory" Your model exceeds the GPU’s VRAM. Solutions: reduce batch size, enable gradient checkpointing, use FSDP/DeepSpeed for model sharding, or request a GPU with more memory.

Job stuck in PENDING with reason "ReqNodeNotAvail" The GPU type you requested doesn’t exist or all matching nodes are down. Run sinfo -o "%N %G %T" -p gpu to check available GPU nodes.

CUDA_VISIBLE_DEVICES is empty You forgot --gres=gpu:N. Without it, Slurm doesn’t allocate GPUs even on GPU nodes.

Key Takeaways

Always use --gres — GPUs are not allocated by default. No --gres = no GPUs, even on GPU nodes.
Two launch patterns — gpus-per-task + srun (1 process per GPU) or gpus-per-node + torchrun (1 process per node).
CUDA_VISIBLE_DEVICES is set by Slurm — don’t set it manually in your script. Let Slurm handle the mapping.
Configure NCCL for your network — set NCCL_SOCKET_IFNAME and MASTER_ADDR for multi-node communication to work.

Slurm Fundamentals: Core commands and job lifecycle
Slurm Resource Management: Priority, fair-share, and monitoring
Distributed Parallelism: Data parallel and model parallel strategies
Multi-GPU Communication: NCCL collectives (AllReduce, AllGather)
NCCL Communication: Deep dive into GPU-to-GPU communication primitives