Managing submitted batch jobs on Arc involves monitoring, controlling, and troubleshooting your jobs using Slurm commands. Below is a concise guide on how to do it effectively:

1. Check Your Jobs

View all your active and queued jobs:
squeue -u $USER

or
squeue -u abc123

This displays job IDs, names, partitions, statuses (e.g., R = running, PD = pending), and nodes assigned.

2. Check Job Details

To see detailed information about a specific job:
scontrol show job 

Example:
scontrol show job 123456

This displays job resources, node allocation, start time, and the reason for the pending status.

3. Check Job Output

When a job finishes, Slurm writes output and error logs to files specified in your job script:
#SBATCH --output=output.log
#SBATCH --error=error.log

You can view them using:
cat output.log
less error.log

If you do not specify #SBATCH --output or #SBATCH --error, Slurm will generate slurm-.out for both standard output and standard error.

4. Cancel a Job

If you need to stop a running or queued job:
scancel 

To cancel all your jobs:
scancel -u $USER

5. Monitor Job Efficiency

After completion, check how much CPU, memory, and time your job used:
sacct -j  --format=JobID,JobName,Elapsed,State,AllocCPUs,MaxRSS

This helps optimize future job submissions.

6. Requeue or Hold Jobs
  • Requeue a failed job:

     =scontrol requeue = 
  • Hold or release a job:

     =scontrol hold  scontrol release = 
7. Check System partitions

To view the status of all partitions:
sinfo

This shows available partitions, node states, and time limits.

For a complete guide to Slurm, please refer to the official documentation at https://slurm.schedmd.com/documentation.html.
Topic revision: r1 - 08 Oct 2025, ZhiweiWang
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback