Slurm HPC

SLURM is an open source workload management and job scheduling system. Turing HPC machine, in the department, contains 64cores (4 x 16) and 192 GB RAM.

About System

- The topology of the cluster:
  • The server slurm(.ceng.metu.edu.tr) is for user management and submitting jobs to turing(.ceng.metu.edu.tr). Since the only aim of the Slurm is submitting a job to the Turing HPC, it is installed to contain minimum number of compilers.
  • Users should edit and use attached sample scripts to submit a job. Otherwise, the slurm machine may not respond due to its restricted hardware capacity.
  • Turing HPC contains 2 queues (which are called partitions in SLURM structure), halley and supernova. It is also divided into identical nodes and every node has 1 core and 3 GB RAM.
  • The halley partition contains 40 nodes which is limited to maximum 12 hour-running time while the supernova partition contains 20 nodes which is limited to maximum 24 hour-running time.
  • Slurm is FIFO, that is a submitted job should wait in a partition until the previous jobs have been completed.

Useful Commands

1) sbatch : to submit a job

user@slurm:~$ sbatch slurmscript.sh
Submitted batch job 2

[1] https://computing.llnl.gov/linux/slurm/sbatch.html

2) squeue : to see the status of a submitted job

user@slurm:~$ squeue
             JOBID    PARTITION          NAME        USER    ST       TIME      NODES     NODELIST(REASON)
                 2            halley            slurm_test      user      R          0:00          20                 node-[1-20]
           
            All ST (status codes):
           
            PD (pending), R (running), CA (cancelled), CF(configuring), CG (completing), CD (completed), F (failed), TO (timeout), and NF (node failure):

CA CANCELLED
Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
CD COMPLETED
Job has terminated all processes on all nodes.
CF CONFIGURING
Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
CG COMPLETING
Job is in the process of completing. Some processes on some nodes may still be active.
F FAILED
Job terminated with non-zero exit code or other failure condition.
NF NODE_FAIL
Job terminated due to failure of one or more allocated nodes.
PD PENDING
Job is awaiting resource allocation.
PR PREEMPTED
Job terminated due to preemption.
R RUNNING
Job currently has an allocation.
S SUSPENDED
Job has an allocation, but execution has been suspended.
TO TIMEOUT
Job terminated upon reaching its time limit.

[2] https://computing.llnl.gov/linux/slurm/squeue.html

3) sinfo : to view the partition structure

user@slurm:~$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES      STATE NODELIST
halley             up       12:00:00     40          idle node-[1-40]
supernova*     up    1-00:00:00     20          idle node-[41-60]

sinfo -lNe : to view the partitions with hardware details

user@slurm:~$ sinfo -lNe
Wed Nov 25 18:46:23 2015
NODELIST      NODES  PARTITION       STATE    CPUS    S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
node-[1-40]         40          halley               idle                 1:1:1      3072            0              1               (null) none
node-[41-60]       20          supernova*        idle                 1:1:1      3072            0              1               (null) none

[3] https://computing.llnl.gov/linux/slurm/sinfo.html

4) scancel <job_id> : to cancel a submitted job which has <job_id>

[4] https://computing.llnl.gov/linux/slurm/scancel.html

Sample Scripts

You can download the slurmscript1 and slurmscript2 to edit and submit a job. The scripts are self-explained with comments and if you have questions, you can send emails to admin [at] ceng.metu.edu.tr.

Installed Packages

slurm 15.08.3
Python 2.7.9
Python 3.4.2
openmpi-1.10.1
Java 1.8.0_65
tensorflow-0.11.0
Hadoop 2.6.3
Apache Maven 3.3.9
octave  3.8.2 (octave-control octave-image octave-io octave-optim octave-signal octave-statistics)

References