This is a simple example of a program that checkpoints using python and the pickle class. It will run for 15 minutes.

The script checks for a file called "counting_sheep.pickle" and if it files it will read it in automatically, otherwise it will start a new execution.

Here we are checkpointing only one variable called self.limit, which is a class variable in a class called counting sheep. This can be any number of variables

so long as they are a member of the class.

At the tail end past the comment that says 'execution code here' is where we use our class to create the object (allocate memory, data structures, etc) and run the code. If the file is present it will detect and read it in here, otherwise it will start new.

The "work" method being used is called run(self), and any work code can be done in here. For simplicity's sake, the only code running is a loop that will increment self.limit once every 5 seconds.

You can test this by running the program like this.


When you're in the same directory and notice the output. Then hit CTRL+C to kill the program after it's incremented a few times.

Then start the program again as before and you'll see that the count picks up where it left off before and proceeds. If you don't want it to continue, you can delete the counting_sheep.pickle file and start it and it will start again from 0.


import time
import os
import pickle

checkpoint_file = "counting_sheep.pickle"

class CountingSheep:

    def __init__(self, checkpoint_file):
        self.checkpoint = checkpoint_file            
        self.limit = 0

    def run(self):

        while True:
            print("%i sheep" % self.limit)
            self.limit += 1

            with open(checkpoint_file, 'wb') as f:
                pickle.dump(self, f)

            if self.limit == 180:

        print("Wake up!")

#----- execution code goes after here -----------

if os.path.exists(checkpoint_file):

    with open(checkpoint_file, 'rb') as f:
        cs = pickle.load(f)


    cs = CountingSheep(checkpoint_file)

# now we run it

The slurm job script file can be found here.

#SBATCH --job-name=counting_sheep
#SBATCH --error=slurm-%j.err
#SBATCH --output=slurm-%j.out
#SBATCH --requeue
#SBATCH --partition=testing
#SBATCH --checkpoint=1

#SBATCH --checkpoint-dir=check

scontrol show job $SLURM_JOBID > info.$SLURM_JOBID


-- Mando - 12 Jul 2020
Topic revision: r2 - 13 Jul 2020, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback