SLURM is an open source workload management and job scheduling system. Turing HPC machine, in the department, contains 64cores (4 x 16) and 192 GB RAM.
About System
- The server slurm(.ceng.metu.edu.tr) is for user management and submitting jobs to turing(.ceng.metu.edu.tr). Since the only aim of the Slurm is submitting a job to the Turing HPC, it is installed to contain minimum number of compilers.
- Users should edit and use attached sample scripts to submit a job. Otherwise, the slurm machine may not respond due to its restricted hardware capacity.
- Turing HPC contains 2 queues (which are called partitions in SLURM structure), halley and supernova. It is also divided into identical nodes and every node has 1 core and 3 GB RAM.
- The halley partition contains 40 nodes which is limited to maximum 12 hour-running time while the supernova partition contains 20 nodes which is limited to maximum 24 hour-running time.
- Slurm is FIFO, that is a submitted job should wait in a partition until the previous jobs have been completed.
Useful Commands
user@slurm:~$ sbatch slurmscript.sh
Submitted batch job 2
[1] https://computing.llnl.gov/linux/slurm/sbatch.html
2) squeue : to see the status of a submitted job
user@slurm:~$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 halley slurm_test user R 0:00 20 node-[1-20]
All ST (status codes):
PD (pending), R (running), CA (cancelled), CF(configuring), CG (completing), CD (completed), F (failed), TO (timeout), and NF (node failure):
- CA CANCELLED
- Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
- CD COMPLETED
- Job has terminated all processes on all nodes.
- CF CONFIGURING
- Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
- CG COMPLETING
- Job is in the process of completing. Some processes on some nodes may still be active.
- F FAILED
- Job terminated with non-zero exit code or other failure condition.
- NF NODE_FAIL
- Job terminated due to failure of one or more allocated nodes.
- PD PENDING
- Job is awaiting resource allocation.
- PR PREEMPTED
- Job terminated due to preemption.
- R RUNNING
- Job currently has an allocation.
- S SUSPENDED
- Job has an allocation, but execution has been suspended.
- TO TIMEOUT
- Job terminated upon reaching its time limit.
[2] https://computing.llnl.gov/linux/slurm/squeue.html
3) sinfo : to view the partition structure
user@slurm:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
halley up 12:00:00 40 idle node-[1-40]
supernova* up 1-00:00:00 20 idle node-[41-60]
sinfo -lNe : to view the partitions with hardware details
user@slurm:~$ sinfo -lNe
Wed Nov 25 18:46:23 2015
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
node-[1-40] 40 halley idle 1 1:1:1 3072 0 1 (null) none
node-[41-60] 20 supernova* idle 1 1:1:1 3072 0 1 (null) none
[3] https://computing.llnl.gov/linux/slurm/sinfo.html
4) scancel <job_id> : to cancel a submitted job which has <job_id>
[4] https://computing.llnl.gov/linux/slurm/scancel.html