Slurm (Simple Linux Utility for Resource Management) is a highly configurable workload manager and job scheduler for an HPC cluster. It is open-source software backed up by a large community, installed on many of the Top 500 supercomputers.
Commonly Used Slurm Commands
Access a compute node interactively:
abc123@Arc ~]$ srun --pty bash
Access a compute node in a partition (queue as SGE) interactively, say a
GPU node:
abc123@Arc ~]$ srun -p gpu --gres=gpu:k80:1 --pty bash
Note: --gres=gpu:k80:1 is essential for access a
GPU device on the node. Otherwise, you can log onto a
GPU node, but you cannot use a
GPU device on the node. Use the following command to access a V100
GPU node:
abc123@Arc ~]$ srun -p gpu-100 --gres=gpu:v100:1 --pty bash
Submit a Slurm job: (must be done on a login node)
abc123@Arc ~]$ sbatch jobscript
Display the partition information:
[abc123@login003 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
bigmem up 3-00:00:00 2 idle b[001-002]
softmatter up 3-00:00:00 20 idle c[079-098]
compute1* up 3-00:01:00 65 idle c[001-065]
compute2 up 3-00:01:00 25 idle c[066-074,099-114]
compute3 up 3-00:01:00 0 n/a
computedev up 2:00:00 5 idle c[115-119]
gpu1v100 up 3-00:01:00 30 idle gpu[001-030]
gpu1vector up 3-00:01:00 0 n/a
gpudev up 2:00:00 2 idle gpu[029-030]
gpu2a100 up 3-00:00:00 0 n/a
gpu2v100 up 3-00:00:00 5 idle gpu[031-035]
testing up infinite 1 drain c077
testing up infinite 3 idle c[075-076,078]
Check the status of the nodes:
[abc123@Arc ~]$sinfo -N -l
NODELIST NODES PARTITION STATE
compute001 1 defq* alloc
compute002 1 defq* alloc
compute003 1 defq* idle
.......
Show jobs status:
[abc123@Arc ~]$squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12544 softmatte test yqs327 PD 0:00 1 (Resources)
12545 softmatte test yqs327 PD 0:00 1 (Priority)
12502_1 defq Snap_new pod105 R 4:25:43 3 compute[021-023]
Show jobs status of specified user:
[abc123@Arc ~]$squeue -u abc123
Show jobs in waiting:
[abc123@Arc ~]$squeue --start
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
12544 softmatte test yqs327 PD 2020-07-16T13:36:47 1 (null) (Resources)
12545 softmatte test yqs327 PD 2020-07-16T13:36:47 1 (null) (Priority)
Show jobs on a specified node:
[abc123@Arc ~]$squeue --nodelist=gpu03
Show jobs on a specified partition: squeue –p gpu
[abc123@Arc ~]$squeue –p gpu
Show status of a specify job: Job States:
R - Job is running on compute nodes PD - Job is waiting on compute nodes CG - Job is completing
[abc123@Arc ~]$squeue –j jobID
To check the details of a job (even the job is completed)
[abc123@Arc ~]$sacct -j jobID
Cancel a job:
[abc123@Arc ~]$scancel 12345.
Test a job when the partition is full: The following command does NOT actually submit the job. however, it shows when the job is expected to “run” when submitted. You would have to submit the job to put it in the waiting queue.
[abc123@Arc ~]$sbatch --test-only test.job
sbatch: Job 21255 to start at 2020-08-14T16:36:51 using 80 processors on nodes compute001 in partition defq
Single and Multithreaded Sample Script
#!/bin/bash
#
#SBATCH --job-name=my_job
#SBATCH --output=my_output_file.txt # Delete this line if you want the output file in slurm-jobID.out format. It will be different every time you submit the job.
#SBATCH --partition=defq # defq is the default queue as the all.q in SGE scripts
#SBATCH --time=01:05:00 # Time limit hrs:min:sec. It is an estimation about how long it will take to complete the job. 72:00:00 is the maximum
#SBATCH --nodes=1 # It should be 1 for all non-mpi jobs.
#SBATCH --cpus-per-task=4 # Number of CPU cores per task. Change it to 1 if it is not a multiple-thread job
#SBATCH --ntasks=1 # It should be 1 for all non-MPI jobs. Otherwise, the same application will run multiple times simultaneously
#SBATCH --mail-type=ALL
#SBATCH --mail-user=my-email@utsa.edu #you email address for receiving notices about your job status
. /etc/profile.d/modules.sh
module load your-modules
./your-application
Parallel or OpenMPI Sample Script
#!/bin/bash
#
#SBATCH --job-name=my_job
#SBATCH --output=my_output_file.txt # Delete this line if you want the output file in slurm-jobID.out format. It will be different every time you submit the job.
#SBATCH --partition=defq # defq is the default queue as the all.q in SGE scripts
#SBATCH --time=01:01:00 # Time limit hrs:min:sec. It is an estimation about how long it will take to complete the job.
#SBATCH --ntasks=40 # The number of processes of your Parallel job
#SBATCH --nodes=2 # The minimum number of nodes your processes (specified in the last line) will be running on. Each node on Arc can accommodate
# at least 32 tasks. Please use a small number to conserve the computing resources. 20 is the maximum number allowed on Arc
#SBATCH --mail-type=ALL
#SBATCH --mail-user=my-email@utsa.edu
. /etc/profile.d/modules.sh
module load openmpi/3.0.1
mpirun -n $SLURM_NTASKS ./your_mpi_program
Slurm groups the standard output (the text printed on the screen in interactive mode) for each MPI process, so that the output from an MPI process does not interfere with the output from the other running MPI processes. However, for multi-threading jobs, Slurm does not have such a mechanic to coordinate the standard output for each thread. If the application does not coordinate the standard output among the threads, the content of the output file specified in the Slurm script will be the same as you see on the screen when the job is run interactively. It is the programmer's responsibility to coordinate the output among the threads. Some important Slurm Environment Variables:
$SLURM_JOB_NAME --- Name of the job, specified by --job-name=
$SLURM_NTASKS ---- Specified by -n, --ntasks=
$SLURM_JOB_ID ----- The job ID assigned by Slurm
$SLURM_SUBMIT_DIR ----- The directory from which sbatch was invoked
$SLURMD_NODENAME ----- Name of the node running the job script
Tips on the output file name The output file name can be hardcoded, such as following:
#SBATCH --output=my_output_file.txt
Or simply remove the line from the script. Slurm will use the format "slurm-$JOBIS.out". For example, slurm-79877.out, where 79877 is the job ID assigned by Slurm when the job was submitted. Or you can customize the output file name as below:
#SBATCH --output=something_%j.txt
The variables are not supported in the #SBATCH line
#SBATCH -e $JOB_NAME-$JOB_ID.log
$JOB_NAME-$JOB_ID can be replaced by %j, but there is no replacement for $JOB_NAME.
Here are the supported symbols:
%A Job array's master job allocation number.
%a Job array ID (index) number.
%j Job allocation number.
%N Node name. (Only one file is created, so %N will be replaced by the name of the first node in the job, which is the one that runs the script)
%u User name.
Please note: Slurm variables, such as $SLURM_JOB_ID, are not supported in naming the output files. ---+++++ ---+++++ Tips on MPI task distribution Both --ntasks and --nodes are required in your job script. The Slurm scheduler uses the information to determine how many MPI tasks will put executed on a computing node that is assigned to your job. For example, if you have --ntasks=40, and --nodes=2, 20 tasks will be executed on each of the two assigned nodes. In general, it is not a good practice to run too many tasks on a compute node. The following benchmarks demonstrate the scalability of MPI jobs on an 80-core (with hyperthreading on) compute node.
As shown in the above figures, the jobs do not scale well after --ntasks large than 30 for all cases. That is caused by resource contention on a single node. We recommend executing less than 30 MPI tasks on a single node.
GPU Resource Sample Script
Please refer the page below for the job script to use a
GPU node:
Slurm Tutorials
https://slurm.schedmd.com/tutorials.html SGEtoSLURMconversion.pdf --
AdminUser - 05 Apr 2021