This is a simple example of a program that checkpoints using python and the pickle class. It will run for 15 minutes. The script checks for a file called "counting_sheep.pickle" and if it files it will read it in automatically, otherwise it will start a new execution. Here we are checkpointing only one variable called self.limit, which is a class variable in a class called counting sheep. This can be any number of variables so long as they are a member of the class. At the tail end past the comment that says 'execution code here' is where we use our class to create the object (allocate memory, data structures, etc) and run the code. If the file is present it will detect and read it in here, otherwise, it will start new. The "work" method being used is called run(self), and any work code can be done here. For simplicity's sake, the only code running is a loop that will increment self.limit once every 5 seconds. You can test this by running the program like this.
python counting_sheep.py

When you're in the same directory and notice the output. Then hit CTRL+C to kill the program after it's incremented a few times. Then start the program again as before and you'll see that the count picks up where it left off before and proceeds. If you don't want it to continue, you can delete the counting_sheep.pickle file and start it and it will start again from 0.

#!/bin/python

import time
import os
import pickle
import sys checkpoint_file = "counting_sheep.pickle" class CountingSheep: def __init__(self, checkpoint_file): self.checkpoint = checkpoint_file self.limit = 0 def run(self): while True: print("%i sheep" % self.limit)
sys.stdout.flush() time.sleep(5) self.limit += 1 with open(checkpoint_file, 'wb') as f: pickle.dump(self, f) if self.limit == 180: break print("Wake up!") #----- execution code goes after here ----------- if os.path.exists(checkpoint_file): with open(checkpoint_file, 'rb') as f: cs = pickle.load(f) else: cs = CountingSheep(checkpoint_file) # now we run it cs.run()

The slurm job script file can be found here.

#!/bin/bash
#SBATCH --job-name=counting_sheep
#SBATCH --error=slurm-%j.err
#SBATCH --output=slurm-%j.out
#SBATCH --requeue
#SBATCH --partition=compute1
#SBATCH --checkpoint=1

#SBATCH --checkpoint-dir=check

scontrol show job $SLURM_JOBID > info.$SLURM_JOBID

python counting_sheep.py

This topic: ARC > WebHome > ApplicationCheckpointingAndRestartOnArc > SimpleCheckpointAndRestartForPythonUsingAClass
Topic revision: 17 Sep 2021, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback