Here's a simple example of a checkpointing program being run with a slurm job script that will automatically generate a restart script for when you need to restart from your checkpoint when your wall time has expired on your job. The program is a simple python program that increments a variable and stores a checkpoint of the current value in a json file.
#!/bin/python                                                                                                                                                      

import time
import os
import json
import argparse

count = {}
checkpoint_file = "checkpoint.json"

def counting_sheep(c):

    while True:
        print("%i sheep" % c['sheep'])
        time.sleep(5)
        c['sheep'] += 1

        with open(checkpoint_file,'w') as f:
            json.dump(c, f)

            if c['sheep'] == 500:
                break

    print("Wake up!")



#----- execution code goes after here -----------                                                                                                                  
parser = argparse.ArgumentParser()

parser.add_argument('-c','--checkpoint', type=str, help="Path to my checkpoint file to restart")

args = parser.parse_args()

my_checkpoint_file = args.checkpoint

if not my_checkpoint_file is None:

    with open(my_checkpoint_file, 'r') as f:
        count = json.load(f)

else:

    count = {'sheep': 0}


counting_sheep(count)

The program is run just an execution on the command line like this:
python counting_sheep.py

If the program is terminated, it can be restarted by giving it the checkpoint json file as an argument so it continues where it left off.
python counting_sheep.py -c checkpoint.json

The slurm script that autogenerates the restart script looks like this:
#!/bin/bash                                                                                                                                                        
#SBATCH --job-name=counting_sheep                                                                                                                                  
#SBATCH --error=logs/slurm-%j.err # Error File                                                                                                                     
#SBATCH --output=logs/slurm-%j.out # Output File                                                                                                                   
#SBATCH --requeue                                                                                                                                                  
#SBATCH --open-mode=append                                                                                                                                         
#SBATCH --partition=testing                                                                                                                                        
#SBATCH --time=0-00:05:00 ### Wall clock time limit in Days-HH:MM:SS                                                                                               

CHECKPOINT_FILE="checkpoint.json"
CHECKPOINT_COMMAND="python counting_sheep.py -c $CHECKPOINT_FILE"
RESUBMIT_SCRIPT="resubmit_${SLURM_JOBID}_$SLURM_JOB_NAME.sh"


echo "This job id ${SLURM_JOBID}"
echo "Creating resubmit script"
echo "#!/bin/bash" > $RESUBMIT_SCRIPT
echo "#SBATCH --job-name=counting_sheep" >> $RESUBMIT_SCRIPT
echo "#SBATCH --error=logs/slurm-%j.err # Error File" >> $RESUBMIT_SCRIPT
echo "#SBATCH --output=logs/slurm-%j.out # Output File" >> $RESUBMIT_SCRIPT
echo "#SBATCH --requeue" >> $RESUBMIT_SCRIPT
echo "#SBATCH --open-mode=append" >> $RESUBMIT_SCRIPT
echo "#SBATCH --partition=testing" >> $RESUBMIT_SCRIPT
echo "#SBATCH --ntasks=1" >> $RESUBMIT_SCRIPT echo "#SBATCH --dependency=afterany:${SLURM_JOBID}" >> $RESUBMIT_SCRIPT echo "#SBATCH --time=0-00:05:00 ### Wall clock time limit in Days-HH:MM:SS" >> $RESUBMIT_SCRIPT echo "" >> $RESUBMIT_SCRIPT echo "$CHECKPOINT_COMMAND" >> $RESUBMIT_SCRIPT echo "submitting restart script: $RESUBMIT_SCRIPT" # Restart submitted here sbatch $RESUBMIT_SCRIPT # Run my initial script here. python counting_sheep.py

As you can see, the regenerate script essentially copies most of the slurm commands into a new script with the addition of a flag called '-- dependency'. What this does is copy the current job id into it so that the restart script knows to wait until this current script has stopped before starting. For ease of testing the wall time (--time flag) on this job and the restart job has been set to 5 minutes. The restart command also provides the checkpoint file to the program. Submitting the script automatically submits the restart script and instructs it to wait until the first job has finished. You can see it waiting being dependent on the first job finishing.
[abc123@login01 checkpoint]$ sbatch test.sh
Submitted batch job 12514
[abc123@login01 checkpoint]$ squeue -u abc123
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             12515   testing counting   jjh526 PD       0:00      1 (Dependency)
             12514   testing counting   jjh526  R       0:07      1 compute032

Topic revision: r3 - 28 Oct 2024, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback