If you have a Python script that is expected to run more than 72 hours on Arc, we suggest you break it into a few smaller tasks, so that each of the tasks runs less than 72 hours. You can submit the tasks as dependency tasks so they will run in a pre-defined order. This not only helps you work around the 72 hour run time limit but also lets your script save the intermediate results and continue the execution from the point where the intermediate results are saved. This can be particularly useful in case of system failures or other unexpected events that may interrupt your job.
As an example, let us consider a simple Python script that reads a
NumPy matrix from a file, makes changes to the matrix in a loop with 6000 iterations, and writes the updated matrix to the same file.
import numpy
import sys
#print(sys.argv[1])
if len(sys.argv) ==2:
m = numpy.loadtxt(open(sys.argv[1], "rb"), delimiter=",", skiprows=0).astype("float")
print(m)
else:
print("Wrong CommandLine Argument ")
exit()
for i in range(1,6000):
m = m + 1
print(m)
numpy.savetxt(sys.argv[1], m, delimiter=',')
In the above example, assuming we are doing some very time-consuming operations to the matrix in each iteration rather than simple addition, with 6000 iterations, the script may take more than the 72 hours to run on Arc. You can simply change the 6000 iterations in the above code into a smaller number (2000, for example), and run the script three times to achieve the same results. In interactive mode, you would have to wait until the previous job completes in order to run the next one. Slurm allows you to create dependency batch jobs so that you can submit all three jobs at the same time.
Suppose that we have a file with name data.csv as below:
$ cat data.csv
1,1,1,1
1,1,1,1
1,1,1,1
1,1,1,1
The Slurm job script is as follows with file name test.job:
#!/bin/bash
#SBATCH --job-name=abc # Change to your job name
#SBATCH --partition=compute1
#SBATCH --ntasks=1
#SBATCH --time=00:05:00 # change to a correct time estimation
module load anconda3
python test.py data.csv
You can submit the job from a login node. The job ID printed on the screen will be used for submitting the next job that will start only after this first job has been completed:
$sbatch test.job
Submitted batch job 10151
While the first job is still running, you can submit the second job and make it dependent upon the completion of the first:
$sbatch --dependency=afterok:10151 test.job
Submitted batch job 10152
"--dependency=afterok:10151" means that the current job with ID 10152 will not run until job 10151 has been completed. It will show PD (Pending) if you check the job status with the following command:
squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
10152 compute1 abc iqr224 PD 0:00 1 (Dependency)
10151 compute1 abc iqr224 R 0:08 1 c036
The third job can be submitted with the dependency on the job with ID10152:
$sbatch --dependency=afterok:10152 test.job
Submitted batch job 10153
The final result of the above example after all three jobs have been completed is shown below:
cat data.csv
6.001000000000000000e+03,6.001000000000000000e+03,6.001000000000000000e+03,6.001000000000000000e+03
6.001000000000000000e+03,6.001000000000000000e+03,6.001000000000000000e+03,6.001000000000000000e+03
6.001000000000000000e+03,6.001000000000000000e+03,6.001000000000000000e+03,6.001000000000000000e+03
6.001000000000000000e+03,6.001000000000000000e+03,6.001000000000000000e+03,6.001000000000000000e+03
It is the same result you would get with 6000 iterations in one single job.