Slurm (Simple Linux Utility for Resource Management) is a highly configurable workload manager and job scheduler for an HPC cluster. It is an open-source software backed up by a large community, installed in many of the Top 500 supercomputers.​ Slurm is the sole cluster management software on Shamu from now on.​

Commonly Used Slurm Commands

Access a compute node interactively:
abc123@shamu ~]$ srun --pty bash

Access a compute node in a partition (queue as SGE) interactively, say a GPU node:
abc123@shamu ~]$ srun -p gpu --gres=gpu:k80:1 --pty bash

Note: --gres=gpu:k80:1 is essential for access a GPU device on the node. Otherwise, you can log onto a GPU node, but you cannot use a GPU device on the node. Use the following command to access a V100 GPU node:
abc123@shamu ~]$ srun -p gpu-100 --gres=gpu:v100:1 --pty bash

Submit a Slurm job: (must be done on a login node)
abc123@shamu ~]$ sbatch jobscript​

Display the partition information:
[abc123@shamu ~]$ sinfo
defq* up 3-00:00:00 10 alloc compute[001-002,004,012-014,021-024]
defq* up 3-00:00:00 42 idle compute[003,006-008,015-020,028-031,033-036,038-057,088-091]
bigmem up 3-00:00:00 1 idle compute009
gpu up 3-00:00:00 2 idle gpu[01-02]
plasmon up infinite 1 idle compute025
ids up infinite 2 idle compute[010-011]
millwater up infinite 1 idle compute005
softmatter up infinite 20 alloc compute[092-111]
gpu-v100 up 3-00:00:00 2 idle gpu[03-04]
testing up 15:00 2 idle compute[032,037]

Check the status of the nodes:
[abc123@shamu ~]$sinfo -N -l
NODELIST NODES PARTITION STATE
compute001 1 defq* alloc
compute002 1 defq* alloc
compute003 1 defq* idle
.......

Show jobs status:
[abc123@shamu ~]$squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12544 softmatte test yqs327 PD 0:00 1 (Resources)
12545 softmatte test yqs327 PD 0:00 1 (Priority)
12502_1 defq Snap_new pod105 R 4:25:43 3 compute[021-023]

Show jobs status of specified user:​
[abc123@shamu ~]$squeue -u abc123

Show jobs in waiting:​
[abc123@shamu ~]$squeue --start​
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
12544 softmatte test yqs327 PD 2020-07-16T13:36:47 1 (null) (Resources)
12545 softmatte test yqs327 PD 2020-07-16T13:36:47 1 (null) (Priority)

Show jobs on a specified node:​
[abc123@shamu ~]$squeue --nodelist=gpu03​

Show jobs on a specified partition:​ squeue –p gpu​
[abc123@shamu ~]$squeue –p gpu​

Show status of a specify job:​ Job States​:
R - Job is running on compute nodes​ PD - Job is waiting on compute nodes​ CG - Job is completing​
[abc123@shamu ~]$squeue –j jobID​​

To check the details of a job (even the job is completed)

[abc123@shamu ~]$sacct -j jobID

Cancel a job:​
[abc123@shamu ~]$scancel 12345.​

Test a job when the partition is full:​

The following command does NOT actually submit the job. however, it shows when the job is expected to “run” when submitted. You would have to submit the job to put it in the waiting queue.
[abc123@shamu ~]$sbatch --test-only test.job
sbatch: Job 21255 to start at 2020-08-14T16:36:51 using 80 processors on nodes compute001 in partition defq

Single and Multithreaded Sample Script

#!/bin/bash
#
#SBATCH --job-name=my_job
#SBATCH --output=my_output_file.txt    # Delete this line if you want the output file in slurm-jobID.out format. It will be different every time you submit the job.
#SBATCH --partition=defq               # defq is the default queue as the all.q in SGE scripts
#SBATCH --time=01:05:00                # Time limit hrs:min:sec. It is an estimation about how long it will take to complete the job. 72:00:00 is the maximum
#SBATCH --nodes=1                      # It should be 1 for all non-mpi jobs.
#SBATCH --cpus-per-task=4              # Number of CPU cores per task. Change it to 1 if it is not a multiple-thread job
#SBATCH --ntasks=1                     # It should be 1 for all non-MPI jobs. Otherwise, the same application will run multiple times simultaneously
#SBATCH --mail-type=ALL
#SBATCH --mail-user=my-email@utsa.edu  #you email address for receiving notices about your job status

. /etc/profile.d/modules.sh
module load your-modules
./your-application

Parallel or OpenMPI Sample Script

#!/bin/bash
#
#SBATCH --job-name=my_job
#SBATCH --output=my_output_file.txt    # Delete this line if you want the output file in slurm-jobID.out format. It will be different every time you submit the job.
#SBATCH --partition=defq               # defq is the default queue as the all.q in SGE scripts
#SBATCH --time=01:01:00                # Time limit hrs:min:sec. It is an estimation about how long it will take to complete the job.
#SBATCH --ntasks=40                    # The number of processes of your Parallel job
#SBATCH --nodes=2                      # The minimum number of nodes your processes (specified in the last line) will be running on. Each node on Shamu can accommodate
                                       # at least 32 tasks. Please use a small number to conserve the computing resources. 20 is the maximum number allowed on Shamu
                                        
#SBATCH --mail-type=ALL
#SBATCH --mail-user=my-email@utsa.edu
. /etc/profile.d/modules.sh
module load openmpi/3.0.1

mpirun -n $SLURM_NTASKS ./your_mpi_program

Slurm groups the standard output (the text printed on the screen in interactive mode) for each MPI process, so that the output from an MPI process does not interfere with the output from the other running MPI processes. However, for multi-threading jobs, Slurm does not have such a mechanic to coordinate the standard output for each thread. If the application does not coordinate the standard output among the threads, the content of the output file specified in the Slurm script will be the same as you see on the screen when the job is run interactively. It is the programmer's responsibility to coordinate the output among the threads.

Some important Slurm Environment Variables:
$SLURM_JOB_NAME      --- Name of the job, specified by --job-name=
$SLURM_NTASKS ---- Specified by -n, --ntasks=
$SLURM_JOB_ID ----- The job ID assigned by Slurm
$SLURM_SUBMIT_DIR ----- The directory from which sbatch was invoked
$SLURMD_NODENAME ----- Name of the node running the job script
Tips on MPI task distribution

Both --ntasks and --nodes are required in your job script. The Slurm scheduler uses the information to determine how many MPI tasks will put executed on a computing node that is assigned to your job. For example, if you have --ntasks=40, and --nodes=2, 20 tasks will be executed on each of the two assigned nodes. In general, it is not a good practice to run too many tasks on a compute node. The following benchmarks demonstrate the scalability of MPI jobs on an 80-core (with hyperthreading on) compute node.


As shown in the above figures, the jobs do not scale well after --ntasks large than 30 for all cases. That is caused by resource contention on a single node. We recommend executing less than 30 MPI tasks on a single node.

GPU Resource Sample Script

Please refer the page below for the job script to use a GPU node:
Using the GPU resources

Slurm Tutorials

https://slurm.schedmd.com/tutorials.html

SGEtoSLURMconversion.pdf

-- Zhiwei - 14 Jul 2020
Topic revision: r8 - 23 Sep 2020, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding UTSA Research Support Group? Send feedback