UseSlurmOnShamu < Main

You are here: Foswiki>Main Web>UseSlurmOnShamu (23 Mar 2021, AdminUser)Edit Attach

Slurm (Simple Linux Utility for Resource Management) is a highly configurable workload manager and job scheduler for an HPC cluster. It is an open-source software backed up by a large community, installed in many of the Top 500 supercomputers. Slurm is the sole cluster management software on Shamu from now on.

Commonly Used Slurm Commands

Access a compute node interactively:

abc123@shamu ~]$ srun --pty bash

Access a compute node in a partition (queue as SGE) interactively, say a GPU node:

abc123@shamu ~]$ srun -p gpu --gres=gpu:k80:1 --pty bash

Note: --gres=gpu:k80:1 is essential for access a GPU device on the node. Otherwise, you can log onto a GPU node, but you cannot use a GPU device on the node. Use the following command to access a V100 GPU node:

abc123@shamu ~]$ srun -p gpu-100 --gres=gpu:v100:1 --pty bash

Submit a Slurm job: (must be done on a login node)

abc123@shamu ~]$ sbatch jobscript

Display the partition information:

[abc123@shamu ~]$ sinfo
defq*        up 3-00:00:00    10 alloc compute[001-002,004,012-014,021-024]
defq*        up 3-00:00:00    42  idle compute[003,006-008,015-020,028-031,033-036,038-057,088-091]
bigmem       up 3-00:00:00     1  idle compute009
gpu          up 3-00:00:00     2  idle gpu[01-02]
plasmon      up  infinite     1  idle compute025
ids          up  infinite     2  idle compute[010-011]
millwater    up  infinite     1  idle compute005
softmatter   up  infinite    20 alloc compute[092-111]
gpu-v100     up 3-00:00:00     2  idle gpu[03-04]
testing      up     15:00     2  idle compute[032,037]

Check the status of the nodes:

[abc123@shamu ~]$sinfo -N -l
NODELIST   NODES PARTITION STATE 
compute001     1     defq* alloc 
compute002     1     defq* alloc 
compute003     1     defq* idle 
.......

Show jobs status:

[abc123@shamu ~]$squeue
JOBID PARTITION    NAME    USER ST      TIME NODES NODELIST(REASON)
12544 softmatte    test  yqs327 PD      0:00     1 (Resources)
12545 softmatte    test  yqs327 PD      0:00     1 (Priority)
12502_1     defq Snap_new  pod105 R   4:25:43     3 compute[021-023]

Show jobs status of specified user:

[abc123@shamu ~]$squeue -u abc123

Show jobs in waiting:

[abc123@shamu ~]$squeue --start
JOBID PARTITION    NAME    USER ST         START_TIME NODES SCHEDNODES          NODELIST(REASON)
12544 softmatte    test  yqs327 PD 2020-07-16T13:36:47     1 (null)              (Resources)
12545 softmatte    test  yqs327 PD 2020-07-16T13:36:47     1 (null)              (Priority)

Show jobs on a specified node:

[abc123@shamu ~]$squeue --nodelist=gpu03

Show jobs on a specified partition: squeue –p gpu

[abc123@shamu ~]$squeue –p gpu

Show status of a specify job: Job States:
R - Job is running on compute nodes PD - Job is waiting on compute nodes CG - Job is completing

[abc123@shamu ~]$squeue –j jobID

To check the details of a job (even the job is completed)

[abc123@shamu ~]$sacct -j jobID

Cancel a job:

[abc123@shamu ~]$scancel 12345.

Test a job when the partition is full:

The following command does NOT actually submit the job. however, it shows when the job is expected to “run” when submitted. You would have to submit the job to put it in the waiting queue.

[abc123@shamu ~]$sbatch --test-only test.job
sbatch: Job 21255 to start at 2020-08-14T16:36:51 using 80 processors on nodes compute001 in partition defq

Single and Multithreaded Sample Script

#!/bin/bash
#
#SBATCH --job-name=my_job
#SBATCH --output=my_output_file.txt    # Delete this line if you want the output file in slurm-jobID.out format. It will be different every time you submit the job.
#SBATCH --partition=defq               # defq is the default queue as the all.q in SGE scripts
#SBATCH --time=01:05:00                # Time limit hrs:min:sec. It is an estimation about how long it will take to complete the job. 72:00:00 is the maximum
#SBATCH --nodes=1                      # It should be 1 for all non-mpi jobs.
#SBATCH --cpus-per-task=4              # Number of CPU cores per task. Change it to 1 if it is not a multiple-thread job
#SBATCH --ntasks=1                     # It should be 1 for all non-MPI jobs. Otherwise, the same application will run multiple times simultaneously
#SBATCH --mail-type=ALL
#SBATCH --mail-user=my-email@utsa.edu  #you email address for receiving notices about your job status

. /etc/profile.d/modules.sh
module load your-modules
./your-application

Parallel or OpenMPI Sample Script

#!/bin/bash
#
#SBATCH --job-name=my_job
#SBATCH --output=my_output_file.txt    # Delete this line if you want the output file in slurm-jobID.out format. It will be different every time you submit the job.
#SBATCH --partition=defq               # defq is the default queue as the all.q in SGE scripts
#SBATCH --time=01:01:00                # Time limit hrs:min:sec. It is an estimation about how long it will take to complete the job.
#SBATCH --ntasks=40                    # The number of processes of your Parallel job
#SBATCH --nodes=2                      # The minimum number of nodes your processes (specified in the last line) will be running on. Each node on Shamu can accommodate
                                       # at least 32 tasks. Please use a small number to conserve the computing resources. 20 is the maximum number allowed on Shamu
                                        
#SBATCH --mail-type=ALL
#SBATCH --mail-user=my-email@utsa.edu
. /etc/profile.d/modules.sh
module load openmpi/3.0.1

mpirun -n $SLURM_NTASKS ./your_mpi_program

Slurm groups the standard output (the text printed on the screen in interactive mode) for each MPI process, so that the output from an MPI process does not interfere with the output from the other running MPI processes. However, for multi-threading jobs, Slurm does not have such a mechanic to coordinate the standard output for each thread. If the application does not coordinate the standard output among the threads, the content of the output file specified in the Slurm script will be the same as you see on the screen when the job is run interactively. It is the programmer's responsibility to coordinate the output among the threads.

Some important Slurm Environment Variables:

$SLURM_JOB_NAME      --- Name of the job, specified by --job-name=
$SLURM_NTASKS        ---- Specified by -n, --ntasks=
$SLURM_JOB_ID        ----- The job ID assigned by Slurm 
$SLURM_SUBMIT_DIR    ----- The directory from which sbatch was invoked
$SLURMD_NODENAME     ----- Name of the node running the job script

Tips on the output file name

The output file name can be hardcoded, such as following:

#SBATCH --output=my_output_file.txt

Or simply remove the line from the script. Slurm will use the format "slurm-$JOBIS.out". For example, slurm-79877.out, where 79877 is the job ID assigned by Slurm when the job was submitted.

Or you can customize the output file name as below:

#SBATCH --output=something_%j.txt

The variables are not supported in the #SBATCH line
#SBATCH -e $JOB_NAME-$JOB_ID.log

$JOB_NAME-$JOB_ID can be replaced by %j, but there is no replacement for $JOB_NAME.

Here are the supported symbols:
%A Job array's master job allocation number.
%a Job array ID (index) number.
%j Job allocation number.
%N Node name. (Only one file is created, so %N will be replaced by the name of the first node in the job, which is the one that runs the script)
%u User name.

Please note: Slurm variables, such as $SLURM_JOB_ID, are not supported in naming the output files.

Tips on MPI task distribution

Both --ntasks and --nodes are required in your job script. The Slurm scheduler uses the information to determine how many MPI tasks will put executed on a computing node that is assigned to your job. For example, if you have --ntasks=40, and --nodes=2, 20 tasks will be executed on each of the two assigned nodes. In general, it is not a good practice to run too many tasks on a compute node. The following benchmarks demonstrate the scalability of MPI jobs on an 80-core (with hyperthreading on) compute node.

As shown in the above figures, the jobs do not scale well after --ntasks large than 30 for all cases. That is caused by resource contention on a single node. We recommend executing less than 30 MPI tasks on a single node.

GPU Resource Sample Script

Please refer the page below for the job script to use a GPU node:

Using the GPU resources

Slurm Tutorials

https://slurm.schedmd.com/tutorials.html

SGEtoSLURMconversion.pdf

-- Zhiwei - 14 Jul 2020

Topic revision: r9 - 23 Mar 2021, AdminUser

Main

Webs
ARC
CondaEnvironmentSaysMetadataCorruptedWhenInstalling
Main
Sandbox
System
WebDocumentation

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback