Here's a simple example of a checkpointing program being run with a slurm job script that will automatically generate a restart script for when you need to restart from your checkpoint when your wall time has expired on your job. The program is a simple python program that increments a variable and stores a checkpoint of the current value in a json file.
#!/bin/python
import time
import os
import json
import argparse
count = {}
checkpoint_file = "checkpoint.json"
def counting_sheep(c):
while True:
print("%i sheep" % c['sheep'])
time.sleep(5)
c['sheep'] += 1
with open(checkpoint_file,'w') as f:
json.dump(c, f)
if c['sheep'] == 500:
break
print("Wake up!")
#----- execution code goes after here -----------
parser = argparse.ArgumentParser()
parser.add_argument('-c','--checkpoint', type=str, help="Path to my checkpoint file to restart")
args = parser.parse_args()
my_checkpoint_file = args.checkpoint
if not my_checkpoint_file is None:
with open(my_checkpoint_file, 'r') as f:
count = json.load(f)
else:
count = {'sheep': 0}
counting_sheep(count)
The program is run just an execution on the command line like this:
python counting_sheep.py
If the program is terminated, it can be restarted by giving it the checkpoint json file as an argument so it continues where it left off.
python counting_sheep.py -c checkpoint.json
The slurm script that autogenerates the restart script looks like this:
#!/bin/bash
#SBATCH --job-name=counting_sheep
#SBATCH --error=logs/slurm-%j.err # Error File
#SBATCH --output=logs/slurm-%j.out # Output File
#SBATCH --requeue
#SBATCH --open-mode=append
#SBATCH --partition=testing
#SBATCH --time=0-00:05:00 ### Wall clock time limit in Days-HH:MM:SS
CHECKPOINT_FILE="checkpoint.json"
CHECKPOINT_COMMAND="python counting_sheep.py -c $CHECKPOINT_FILE"
RESUBMIT_SCRIPT="resubmit_${SLURM_JOBID}_$SLURM_JOB_NAME.sh"
echo "This job id ${SLURM_JOBID}"
echo "Creating resubmit script"
echo "#!/bin/bash" > $RESUBMIT_SCRIPT
echo "#SBATCH --job-name=counting_sheep" >> $RESUBMIT_SCRIPT
echo "#SBATCH --error=logs/slurm-%j.err # Error File" >> $RESUBMIT_SCRIPT
echo "#SBATCH --output=logs/slurm-%j.out # Output File" >> $RESUBMIT_SCRIPT
echo "#SBATCH --requeue" >> $RESUBMIT_SCRIPT
echo "#SBATCH --open-mode=append" >> $RESUBMIT_SCRIPT
echo "#SBATCH --partition=testing" >> $RESUBMIT_SCRIPT
echo "#SBATCH --ntasks=1" >> $RESUBMIT_SCRIPT
echo "#SBATCH --dependency=afterany:${SLURM_JOBID}" >> $RESUBMIT_SCRIPT
echo "#SBATCH --time=0-00:05:00 ### Wall clock time limit in Days-HH:MM:SS" >> $RESUBMIT_SCRIPT
echo "" >> $RESUBMIT_SCRIPT
echo "$CHECKPOINT_COMMAND" >> $RESUBMIT_SCRIPT
echo "submitting restart script: $RESUBMIT_SCRIPT"
# Restart submitted here
sbatch $RESUBMIT_SCRIPT
# Run my initial script here.
python counting_sheep.py
As you can see, the regenerate script essentially copies most of the slurm commands into a new script with the addition of a flag called '-- dependency'. What this does is copy the current job id into it so that the restart script knows to wait until this current script has stopped before starting. For ease of testing the wall time (--time flag) on this job and the restart job has been set to 5 minutes. The restart command also provides the checkpoint file to the program. Submitting the script automatically submits the restart script and instructs it to wait until the first job has finished. You can see it waiting being dependent on the first job finishing.
[abc123@login01 checkpoint]$ sbatch test.sh
Submitted batch job 12514
[abc123@login01 checkpoint]$ squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12515 testing counting jjh526 PD 0:00 1 (Dependency)
12514 testing counting jjh526 R 0:07 1 compute032