Tagged with

hpc

Explore machine learning concepts related to hpc. Clear explanations and practical insights.

Concepts Found

Concepts Related to hpc

March 15, 2026

Slurm Fundamentals: Job Scheduling on HPC Clusters

Complete guide to Slurm — architecture, core commands, job lifecycle, job scripts, array jobs, dependencies, monitoring with squeue/sacct, and troubleshooting failed jobs on HPC clusters.

hpc slurm job-scheduling cluster-computing

9 min readConcept

March 15, 2026

Slurm GPU Allocation for Distributed Training

Complete guide to GPU allocation on Slurm — --gres flags, CUDA_VISIBLE_DEVICES remapping, GPU topology and NVLink binding, MIG partitioning, production job scripts, and debugging common GPU errors.

hpc slurm gpu-computing distributed-training

10 min readConcept

March 15, 2026

Slurm Resource Management and Job Priority

How Slurm decides which jobs run first — priority factors, fair-share scheduling, backfill, and monitoring commands (squeue, sinfo, sacct).

hpc slurm job-scheduling resource-management

6 min readConcept

March 15, 2026

MPI Fundamentals: Message Passing for Distributed Computing

Complete MPI guide — point-to-point and collective communication with real C and mpi4py code, deadlock simulation, performance benchmarking, communicator splitting, and debugging on HPC clusters.

hpc mpi distributed-computing parallel-programming

13 min readConcept

March 15, 2026

OpenMP: Shared-Memory Parallel Programming

OpenMP parallel programming: fork-join model, scheduling, data races, false sharing, NUMA thread affinity, and GPU offloading.

hpc parallel-programming openmp multithreading performance c++

16 min readConcept

March 15, 2026

Slurm Accounting and Resource Tracking

How Slurm tracks resource consumption through account hierarchies, TRES billing, and resource limits — sacctmgr, sreport, and the association model explained.

hpc slurm accounting resource-management

6 min readConcept

March 15, 2026

HPC Performance Optimization: Scaling, Profiling, and Tuning

Mastering HPC performance — Amdahl's Law, Gustafson's Law, strong vs weak scaling, roofline model, communication-computation overlap, load balancing, and profiling with Nsight and VTune.

hpc performance optimization parallel-computing

10 min readConcept

March 15, 2026

Slurm Backfill Scheduling: How Small Jobs Fill the Gaps

How sched/backfill works — the algorithm that lets small jobs run in gaps while large jobs wait, why accurate time limits matter, and the key tuning parameters (bf_interval, bf_window, bf_max_job_test).

hpc slurm scheduling backfill

7 min readConcept