ApplicationCheckpointingAndRestartOnArc < ARC

Checkpointing is the process of saving the execution state of an application such that this saved state can be used to continue the execution at a later time. Typically, the execution state is written to a file. Restart is the step that comes after checkpointing and helps in resuming the application from the saved state.

Checkpointing not only saves time by offering the capability to resume the execution of an application in case of a hardware failure, but it also helps in overcoming the time-limits associated with the different job queues/partitions. Following are some of the approaches in which an application can be made to write checkpoints:

For further guidance on implementing checkpointing-and-restart in your code, please contact the Research Computing Support Group.

This topic: ARC > WebHome > ApplicationCheckpointingAndRestartOnArc
Topic revision: 20 Oct 2021, AdminUser

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback

Checkpoint-and-Restart for a C Program

Checkpoint-and-Restart for Python Using a Class

Checkpoint-and-Restart for Python Script Using Slurm Dependent Job

Checkpoint-and-Restart for R Script Using Slurm Dependent Job

Restart Script Generation Example

Checkpointing-and-Restart for Deep Learning Models with Tensorflow