Checkpointing is the action of saving the state of a running process to a checkpoint image file. Restart is the actions to resume the checkpointed application from saved state.
Checkpointing not only offers the capability to resume running without losing too much valuable CPU time in case of system failure, it also offers the following conveniences, including process migration, process replication, extended sessions, debugging, and fast startup, etc.
Checkpoint and Restart with DMTCP
DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing tool for distributed applications. DMTCP is available on Shamu via a module named dmtcp/2.6.0. We are testing the tool on Shamu. It works for Sequential and Multi-threading jobs both via interactive sessions and batch jobs. It does not work for MPI based applications at this moment.
Many applications have the build-in feature for checkpoint-and-restart. Users can also define their checkpoint-and-restart in their code without a third-part system.