The Problem Backfill Solves
Strict priority scheduling has a fatal flaw. If the highest-priority job needs 64 nodes and only 4 are free, those 4 nodes sit completely idle until 60 more become available. On a busy cluster, this can mean hours of wasted capacity.
Backfill scheduling fixes this by asking a simple question: can a lower-priority job run and finish before the highest-priority job’s resources become available? If yes, start the small job now. Utilization goes up without delaying the important work.
This is why Slurm’s sched/backfill plugin is the default scheduler for production clusters. The alternative, sched/builtin, uses strict FIFO and wastes enormous capacity on any cluster with heterogeneous workloads.
How the Algorithm Works
The backfill scheduler runs periodically (every bf_interval seconds) and follows this sequence:
- Build a timeline of when each running job will complete, based on its
--timelimit - Find the earliest start for the top-priority pending job — the first moment when enough nodes are simultaneously free
- Reserve those slots so no backfill job can delay the priority job
- Iterate through lower-priority jobs and check if each can fit in the remaining gaps — enough free nodes for long enough — without extending past the priority job’s reserved start
A job that fits gets started immediately. A job that doesn’t fit stays PENDING and waits for the next backfill cycle.
Why Time Limits Are Critical
The backfill scheduler uses your --time limit as the contract for when your job will finish. It doesn’t know your actual runtime — it only knows what you requested.
If your job actually runs for 2 hours but you requested --time=7-00:00:00, the scheduler treats it as a 7-day job. Those 7 days of node-hours are blocked from backfill consideration, even though the job will finish in 2 hours.
# Bad: blocks 168 hours of backfill scheduling per node #SBATCH --time=7-00:00:00 # Good: allows backfill to use the remaining 166 hours #SBATCH --time=02:30:00
Overestimation Hurts Everyone
A 2-hour job with a 7-day time limit doesn’t just hurt your queue position — it blocks the backfill scheduler from filling gaps with other users’ jobs. Multiply this by hundreds of jobs and cluster utilization drops significantly.
Tuning Parameters
These parameters live in slurm.conf and control how aggressively the backfill scheduler operates:
bf_interval
How often the backfill scheduler runs, in seconds. Default: 30.
SchedulerParameters=bf_interval=30
Lower values improve responsiveness but increase controller CPU load. On clusters with 10,000+ pending jobs, 60–120 seconds may be necessary.
bf_window
How far into the future (in seconds) the scheduler looks for gaps. Default: 86400 (1 day).
SchedulerParameters=bf_window=172800 # 2 days
A larger window finds more backfill opportunities but increases computation time. If your jobs typically run for hours, a 1–2 day window is sufficient. If jobs run for days, extend it.
bf_max_job_test
Maximum number of pending jobs evaluated per backfill cycle. Default: 200.
SchedulerParameters=bf_max_job_test=500
On clusters with thousands of pending jobs, only the top 200 (by priority) are considered for backfill each cycle. Increasing this finds more opportunities but makes each cycle slower. The key tradeoff: thoroughness vs scheduler overhead.
bf_resolution
Time granularity for scheduling decisions, in seconds. Default: 60.
SchedulerParameters=bf_resolution=300 # 5-minute slots
Coarser resolution (larger values) makes the scheduler faster but may miss tight gaps. For GPU clusters where jobs run for hours, 300–600 seconds works well.
Combining Parameters
# Typical production configuration SchedulerType=sched/backfill SchedulerParameters=bf_interval=30,bf_window=172800,bf_max_job_test=500,bf_resolution=300
Backfill vs FIFO
| Aspect | sched/builtin (FIFO) | sched/backfill |
|---|---|---|
| Scheduling | Strict priority order | Priority + gap filling |
| Utilization | Low (nodes idle waiting) | High (gaps filled) |
| Fairness | Simple — first come, first served | Complex — small jobs jump ahead |
| Overhead | Minimal | Moderate (timeline computation) |
| Time limits | Irrelevant to scheduling | Critical for gap identification |
| Best for | Homogeneous, few-job clusters | Production clusters with mixed workloads |
Almost Everyone Uses Backfill
In practice, sched/backfill is the standard for any cluster with more than a
handful of users. The utilization gains far outweigh the slight scheduling
overhead. FIFO is mainly used for testing or single-user setups.
Monitoring Backfill Behavior
Check the scheduler’s activity
# See backfill scheduling statistics sdiag # Key metrics to watch: # - Backfill cycle count # - Backfill last cycle (seconds) # - Backfill mean cycle (seconds) # - Backfilled jobs since start
Check estimated start times
# See when the scheduler expects your job to start squeue --start -u $USER # JOBID PARTITION NAME USER ST START_TIME # 12345 gpu train-llm abhik PD 2026-03-16T08:00:00
If START_TIME shows N/A, the scheduler can’t predict when resources will be available. This usually means the job requests resources that don’t exist or the backfill window is too short.
Identify backfilled jobs
# Jobs with reason "None" in squeue were backfill-scheduled squeue -o "%.8i %.9P %.20j %.8u %.2t %r" | grep None
Common Pitfalls
1. Not Setting SchedulerType
If slurm.conf has SchedulerType=sched/builtin, no backfill happens. Nodes sit idle while large jobs wait. This is the single biggest utilization killer on misconfigured clusters.
2. Wildly Overestimating Time Limits
Users request the partition maximum “just to be safe.” This makes 90% of node-hours invisible to the backfill scheduler. Encourage accurate time estimates with sacct data from past jobs.
3. Ignoring sdiag
If Backfill last cycle in sdiag exceeds bf_interval, the scheduler can’t keep up. Reduce bf_max_job_test or increase bf_resolution to speed up cycles.
4. Setting bf_window Too Small
A 4-hour backfill window on a cluster where jobs run for 24+ hours means the scheduler barely looks ahead. Most gaps beyond 4 hours are invisible.
Key Takeaways
-
Backfill fills idle gaps — small jobs run in spaces between running jobs without delaying the highest-priority pending job.
-
Accurate time limits are essential — the scheduler uses your --time value to compute gaps. Overestimation blocks backfill.
-
Four key parameters — bf_interval (frequency), bf_window (lookahead), bf_max_job_test (breadth), bf_resolution (granularity).
-
Monitor with sdiag — track backfill cycle time and job count to ensure the scheduler can keep up with demand.
Related Concepts
- Slurm Fundamentals: Core commands, job lifecycle, and cluster architecture
- Slurm Resource Management: Priority formula and fair-share that determine which jobs backfill evaluates first
- Slurm Accounting: Resource limits (GrpTRES) that constrain what backfill can schedule
- HPC Performance Optimization: Utilization analysis and scaling strategies
Further Reading
- Slurm Scheduling Configuration Guide - Official documentation on SchedulerType and SchedulerParameters
- sdiag Man Page - Scheduler diagnostics including backfill cycle statistics
- Slurm Backfill Plugin - SchedMD's reference on configuring SchedulerType and backfill parameters
