This is a simple example of a program that checkpoints using python and the pickle class. It will run for 15 minutes. The script checks for a file called "counting_sheep.pickle" and if it files it will read it in automatically, otherwise it will start a new execution. Here we are checkpointing only one variable called self.limit, which is a class variable in a class called counting sheep. This can be any number of variables so long as they are a member of the class. At the tail end past the comment that says 'execution code here' is where we use our class to create the object (allocate memory, data structures, etc) and run the code. If the file is present it will detect and read it in here, otherwise, it will start new. The "work" method being used is called run(self), and any work code can be done here. For simplicity's sake, the only code running is a loop that will increment self.limit once every 5 seconds. You can test this by running the program like this.
python counting_sheep.py
When you're in the same directory and notice the output. Then hit CTRL+C to kill the program after it's incremented a few times. Then start the program again as before and you'll see that the count picks up where it left off before and proceeds. If you don't want it to continue, you can delete the counting_sheep.pickle file and start it and it will start again from 0.
#!/bin/python
import time
import os
import pickle
import sys
checkpoint_file = "counting_sheep.pickle"
class CountingSheep:
def __init__(self, checkpoint_file):
self.checkpoint = checkpoint_file
self.limit = 0
def run(self):
while True:
print("%i sheep" % self.limit)
sys.stdout.flush()
time.sleep(5)
self.limit += 1
with open(checkpoint_file, 'wb') as f:
pickle.dump(self, f)
if self.limit == 180:
break
print("Wake up!")
#----- execution code goes after here -----------
if os.path.exists(checkpoint_file):
with open(checkpoint_file, 'rb') as f:
cs = pickle.load(f)
else:
cs = CountingSheep(checkpoint_file)
# now we run it
cs.run()
The slurm job script file can be found here.
#!/bin/bash
#SBATCH --job-name=counting_sheep
#SBATCH --error=slurm-%j.err
#SBATCH --output=slurm-%j.out
#SBATCH --requeue
#SBATCH --partition=compute1
#SBATCH --checkpoint=1
#SBATCH --checkpoint-dir=check
scontrol show job $SLURM_JOBID > info.$SLURM_JOBID
python counting_sheep.py