Checkpoint and Restart

Checkpointing is the action of saving the state of a running process to a checkpoint image file. Restart is the actions to resume the checkpointed application from saved state.

Checkpointing not only offers the capability to resume running without losing too much valuable CPU time in case of system failure, it also offers the following conveniences, including process migration, process replication, extended sessions, debugging, and fast startup, etc.

Checkpoint and Restart with DMTCP

DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing tool for distributed applications. DMTCP is available on Shamu via a module named dmtcp/2.6.0. We are testing the tool on Shamu. It works for Sequential and Multi-threading jobs both via interactive sessions and batch jobs. It does not work for MPI based applications at this moment.

Checkpoint and Restart Sequential and Multi-threading Applications Interactively (non Batch)

Checkpoint and Restart Sequential and Multi-threading Batch Jobs

Embed DMTCP Checkpoint and Restart in C Code

Checkpoint and Restart without DMTCP

Many applications have the build-in feature for checkpoint-and-restart. Users can also define their checkpoint-and-restart in their code without a third-part system.

C Programming With Self-Defined Checkpoint-and-Restart

Checkpoint-and-Restart For Deep Learning Models with Tensorflow

Simple checkpoing and restart for python using a class
Restart script generation example

-- Zhiwei - 08 Jul 2020
Topic revision: r9 - 14 Jul 2020, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback