Slurm Fundamentals: Job Scheduling on HPC Clusters
Complete guide to Slurm — architecture, core commands, job lifecycle, job scripts, array jobs, dependencies, monitoring with squeue/sacct, and troubleshooting failed jobs on HPC clusters.
Explore machine learning concepts related to slurm. Clear explanations and practical insights.
Complete guide to Slurm — architecture, core commands, job lifecycle, job scripts, array jobs, dependencies, monitoring with squeue/sacct, and troubleshooting failed jobs on HPC clusters.
Complete guide to GPU allocation on Slurm — --gres flags, CUDA_VISIBLE_DEVICES remapping, GPU topology and NVLink binding, MIG partitioning, production job scripts, and debugging common GPU errors.
How Slurm decides which jobs run first — priority factors, fair-share scheduling, backfill, and monitoring commands (squeue, sinfo, sacct).
How Slurm tracks resource consumption through account hierarchies, TRES billing, and resource limits — sacctmgr, sreport, and the association model explained.
How sched/backfill works — the algorithm that lets small jobs run in gaps while large jobs wait, why accurate time limits matter, and the key tuning parameters (bf_interval, bf_window, bf_max_job_test).