Slurm Backfill Scheduling: How Small Jobs Fill the Gaps

How sched/backfill works — the algorithm that lets small jobs run in gaps while large jobs wait, why accurate time limits matter, and the key tuning parameters (bf_interval, bf_window, bf_max_job_test).

The Problem Backfill Solves

Strict priority scheduling has a fatal flaw. If the highest-priority job needs 64 nodes and only 4 are free, those 4 nodes sit completely idle until 60 more become available. On a busy cluster, this can mean hours of wasted capacity.

Backfill scheduling fixes this by asking a simple question: can a lower-priority job run and finish before the highest-priority job’s resources become available? If yes, start the small job now. Utilization goes up without delaying the important work.

This is why Slurm’s sched/backfill plugin is the default scheduler for production clusters. The alternative, sched/builtin, uses strict FIFO and wastes enormous capacity on any cluster with heterogeneous workloads.

How the Algorithm Works

The backfill scheduler runs periodically (every bf_interval seconds) and follows this sequence:

Build a timeline of when each running job will complete, based on its --time limit
Find the earliest start for the top-priority pending job — the first moment when enough nodes are simultaneously free
Reserve those slots so no backfill job can delay the priority job
Iterate through lower-priority jobs and check if each can fit in the remaining gaps — enough free nodes for long enough — without extending past the priority job’s reserved start

A job that fits gets started immediately. A job that doesn’t fit stays PENDING and waits for the next backfill cycle.

Why Time Limits Are Critical

The backfill scheduler uses your --time limit as the contract for when your job will finish. It doesn’t know your actual runtime — it only knows what you requested.

If your job actually runs for 2 hours but you requested --time=7-00:00:00, the scheduler treats it as a 7-day job. Those 7 days of node-hours are blocked from backfill consideration, even though the job will finish in 2 hours.

# Bad: blocks 168 hours of backfill scheduling per node
#SBATCH --time=7-00:00:00

# Good: allows backfill to use the remaining 166 hours
#SBATCH --time=02:30:00

Overestimation Hurts Everyone

A 2-hour job with a 7-day time limit doesn’t just hurt your queue position — it blocks the backfill scheduler from filling gaps with other users’ jobs. Multiply this by hundreds of jobs and cluster utilization drops significantly.

Tuning Parameters

These parameters live in slurm.conf and control how aggressively the backfill scheduler operates:

bf_interval

How often the backfill scheduler runs, in seconds. Default: 30.

SchedulerParameters=bf_interval=30

Lower values improve responsiveness but increase controller CPU load. On clusters with 10,000+ pending jobs, 60–120 seconds may be necessary.

bf_window

How far into the future (in seconds) the scheduler looks for gaps. Default: 86400 (1 day).

SchedulerParameters=bf_window=172800  # 2 days

A larger window finds more backfill opportunities but increases computation time. If your jobs typically run for hours, a 1–2 day window is sufficient. If jobs run for days, extend it.

bf_max_job_test

Maximum number of pending jobs evaluated per backfill cycle. Default: 200.

SchedulerParameters=bf_max_job_test=500

On clusters with thousands of pending jobs, only the top 200 (by priority) are considered for backfill each cycle. Increasing this finds more opportunities but makes each cycle slower. The key tradeoff: thoroughness vs scheduler overhead.

bf_resolution

Time granularity for scheduling decisions, in seconds. Default: 60.

SchedulerParameters=bf_resolution=300  # 5-minute slots

Coarser resolution (larger values) makes the scheduler faster but may miss tight gaps. For GPU clusters where jobs run for hours, 300–600 seconds works well.

Combining Parameters

# Typical production configuration
SchedulerType=sched/backfill
SchedulerParameters=bf_interval=30,bf_window=172800,bf_max_job_test=500,bf_resolution=300

Backfill vs FIFO

Aspect	sched/builtin (FIFO)	sched/backfill
Scheduling	Strict priority order	Priority + gap filling
Utilization	Low (nodes idle waiting)	High (gaps filled)
Fairness	Simple — first come, first served	Complex — small jobs jump ahead
Overhead	Minimal	Moderate (timeline computation)
Time limits	Irrelevant to scheduling	Critical for gap identification
Best for	Homogeneous, few-job clusters	Production clusters with mixed workloads

Almost Everyone Uses Backfill

In practice, sched/backfill is the standard for any cluster with more than a handful of users. The utilization gains far outweigh the slight scheduling overhead. FIFO is mainly used for testing or single-user setups.

Monitoring Backfill Behavior

Check the scheduler’s activity

# See backfill scheduling statistics
sdiag

# Key metrics to watch:
# - Backfill cycle count
# - Backfill last cycle (seconds)
# - Backfill mean cycle (seconds)
# - Backfilled jobs since start

Check estimated start times

# See when the scheduler expects your job to start
squeue --start -u $USER
# JOBID  PARTITION  NAME       USER   ST  START_TIME
# 12345  gpu        train-llm  abhik  PD  2026-03-16T08:00:00

If START_TIME shows N/A, the scheduler can’t predict when resources will be available. This usually means the job requests resources that don’t exist or the backfill window is too short.

Identify backfilled jobs

# Jobs with reason "None" in squeue were backfill-scheduled
squeue -o "%.8i %.9P %.20j %.8u %.2t %r" | grep None