Slurm HPC

SLURM is an open source workload management and job scheduling system. Turing HPC machine, in the department, contains 64cores (4 x 16) and 192 GB RAM.

About System

- The topology of the cluster:

The server slurm(.ceng.metu.edu.tr) is for user management and submitting jobs to turing(.ceng.metu.edu.tr). Since the only aim of the Slurm is submitting a job to the Turing HPC, it is installed to contain minimum number of compilers.
Users should edit and use attached sample scripts to submit a job. Otherwise, the slurm machine may not respond due to its restricted hardware capacity.
Turing HPC contains 2 queues (which are called partitions in SLURM structure), halley and supernova. It is also divided into identical nodes and every node has 1 core and 3 GB RAM.
The halley partition contains 40 nodes which is limited to maximum 12 hour-running time while the supernova partition contains 20 nodes which is limited to maximum 24 hour-running time.

Slurm is FIFO, that is a submitted job should wait in a partition until the previous jobs have been completed.

Useful Commands

1) sbatch : to submit a job

user@slurm:~$ sbatch slurmscript.sh
Submitted batch job 2

[1] https://computing.llnl.gov/linux/slurm/sbatch.html

2) squeue : to see the status of a submitted job

user@slurm:~$ squeue
             JOBID    PARTITION          NAME        USER    ST       TIME     NODES     NODELIST(REASON)
                 2            halley        slurm_test      user      R        0:00          20                 node-[1-20]

            All ST (status codes):

            PD (pending), R (running), CA (cancelled), CF(configuring), CG (completing), CD (completed), F (failed), TO (timeout), and NF (node failure):

CA CANCELLED

Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.

CD COMPLETED

Job has terminated all processes on all nodes.

CF CONFIGURING

Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).

CG COMPLETING

Job is in the process of completing. Some processes on some nodes may still be active.

F FAILED

Job terminated with non-zero exit code or other failure condition.

NF NODE_FAIL

Job terminated due to failure of one or more allocated nodes.

PD PENDING

Job is awaiting resource allocation.

PR PREEMPTED

Job terminated due to preemption.

R RUNNING

Job currently has an allocation.

S SUSPENDED

Job has an allocation, but execution has been suspended.

TO TIMEOUT

Job terminated upon reaching its time limit.

[2] https://computing.llnl.gov/linux/slurm/squeue.html

3) sinfo : to view the partition structure

user@slurm:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES     STATE NODELIST
halley             up       12:00:00     40          idle node-[1-40]
supernova*     up    1-00:00:00     20          idle node-[41-60]

sinfo -lNe : to view the partitions with hardware details

user@slurm:~$ sinfo -lNe
Wed Nov 25 18:46:23 2015
NODELIST      NODES PARTITION       STATE    CPUS    S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
node-[1-40]         40          halley               idle          1        1:1:1      3072            0              1               (null) none
node-[41-60]       20          supernova*        idle          1        1:1:1      3072            0              1               (null) none

[3] https://computing.llnl.gov/linux/slurm/sinfo.html

4) scancel <job_id> : to cancel a submitted job which has <job_id>

[4] https://computing.llnl.gov/linux/slurm/scancel.html

Sample Scripts

You can download the slurmscript1 and slurmscript2 to edit and submit a job. The scripts are self-explained with comments and if you have questions, you can send emails to admin [at] ceng.metu.edu.tr.

Installed Packages

slurm 15.08.3

Python 2.7.9

Python 3.4.2

openmpi-1.10.1

Java 1.8.0_65

tensorflow-0.11.0

Hadoop 2.6.3

Apache Maven 3.3.9

octave 3.8.2 (octave-control octave-image octave-io octave-optim octave-signal octave-statistics)

References

[5] https://computing.llnl.gov/linux/slurm/slurm.conf.html
[6] https://computing.llnl.gov/linux/slurm/configurator.html
[7] https://www.schedmd.com/