Checkpointing is the process of saving the execution state of an application such that this saved state can be used to continue the execution at a later time. Typically, the execution state is written to a file. Restart is the step that comes after checkpointing and helps in resuming the application from the saved state.

Checkpointing not only saves time by offering the capability to resume the execution of an application in case of a hardware failure, but it also helps in overcoming the time-limits associated with the different job queues/partitions. Following are some of the approaches in which an application can be made to write checkpoints:
Checkpoint-and-Restart for a C Program
Checkpoint-and-Restart for Python Using a Class
Checkpoint-and-Restart for Python Script Using Slurm Dependent Job
Checkpoint-and-Restart for R Script Using Slurm Dependent Job
Restart Script Generation Example
Checkpointing-and-Restart for Deep Learning Models with Tensorflow

For further guidance on implementing checkpointing-and-restart in your code, please contact the Research Computing Support Group.

This topic: ARC > WebHome > ApplicationCheckpointingAndRestartOnArc
Topic revision: 20 Oct 2021, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback