Skip to main content

Slurm HPC

SLURM is an open source workload management and job scheduling system. Turing HPC machine, in the department, contains 64 cores (4 x 16) and 192 GB RAM.

The topology of the cluster:

  • The server slurm(.ceng.metu.edu.tr) is for user management and submitting jobs to turing(.ceng.metu.edu.tr). Since the only aim of the Slurm is submitting a job to the Turing HPC, it is installed to contain minimum number of compilers.
  • Users should edit and use attached sample scripts to submit a job. Otherwise, the slurm machine may not respond due to its restricted hardware capacity.
  • Turing HPC contains 2 queues (which are called partitions in SLURM structure), halley and supernova. It is also divided into identical nodes and every node has 1 core and 3 GB RAM.
  • The halley partition contains 40 nodes which is limited to maximum 12 hour-running time while the supernova partition contains 20 nodes which is limited to maximum 24 hour-running time.
  • Slurm is FIFO, that is a submitted job should wait in a partition until the previous jobs have been completed.
1) sbatch : to submit a job
user@slurm:~$ sbatch slurmscript.sh 
Submitted batch job 2

Reference: sbatch documentation

2) squeue : to see the status of a submitted job
user@slurm:~$ squeue 
JOBID    PARTITION    NAME        USER     ST      TIME     NODES    NODELIST(REASON)
2        halley       slurm_test  user     R       0:00     20       node-[1-20]

Status Codes (ST):

CA CANCELLED Job was explicitly cancelled by the user or system administrator.
CD COMPLETED Job has terminated all processes on all nodes.
CF CONFIGURING Job has been allocated resources, but are waiting for them to become ready.
CG COMPLETING Job is in the process of completing.
F FAILED Job terminated with non-zero exit code or other failure condition.
NF NODE_FAIL Job terminated due to failure of one or more allocated nodes.
PD PENDING Job is awaiting resource allocation.
PR PREEMPTED Job terminated due to preemption.
R RUNNING Job currently has an allocation.
S SUSPENDED Job has an allocation, but execution has been suspended.
TO TIMEOUT Job terminated upon reaching its time limit.

Reference: squeue documentation

3) sinfo : to view the partition structure / sinfo -lNe : to view the partitions with hardware details
user@slurm:~$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES    STATE NODELIST
halley      up     12:00:00   40       idle  node-[1-40]
supernova* up     1-00:00:00 20       idle  node-[41-60]

user@slurm:~$ sinfo -lNe
NODELIST     NODES  PARTITION   STATE  CPUS  S:C:T  MEMORY  TMP_DISK WEIGHT FEATURES REASON
node-[1-40]  40     halley      idle   1     1:1:1  3072    0        1      (null)   none
node-[41-60] 20     supernova* idle   1     1:1:1  3072    0        1      (null)   none

Reference: sinfo documentation

4) scancel : to cancel a submitted job

Usage: scancel <job_id>

Reference: scancel documentation

You can download the slurmscript1 and slurmscript2 to edit and submit a job. The scripts are self-explained with comments and if you have questions, you can send emails to admin [at] ceng.metu.edu.tr.

  • slurm 15.08.3
  • Python 2.7.9 / Python 3.4.2
  • openmpi-1.10.1
  • Java 1.8.0_65
  • tensorflow-0.11.0
  • Hadoop 2.6.3
  • Apache Maven 3.3.9
  • octave 3.8.2
Last updated