To checkpoint and restart an interactive job, follow the steps below:

- log onto a compute node from the login node.

srun --pty bash

- Load the dmtcp module

module load dmtcp/2.6.0

where 600 means checkpoint occurs every 600 seconds. You can change it to a number that suits your application. --daemon option tells the coordinator to run in the background so that you do not need to start another shell. --exist-on-last option tells the coordinator to exit when the application exits.

- Launch your application with the dmtcp_launch

dmtcp_launch -i 600 your-app

where 600 means checkpoint occurs every 600 seconds. You can change it to a number that suits your application. --daemon option tells the coordinator to run in the background so that you do not need to start another shell. --exist-on-last option tells the coordinator to exit when the application exits. Your application will save a checkpoint image every 600 seconds. If something happens before the application finishes the work, you can resume the execution from the last checkpoint.

To restart a checkpointed application, follow the steps below:

- Start the dmtcp coordinator as you did for starting the application

dmtcp_coordinator -i 600 --exit-on-last --daemon

- Restart the application from the same directory where you ran the application the first time. This is essential as all the checkpoint images and restart script are save there.

dmtcp_restart -i 600 ckp*



-- Zhiwei -- 12 Jul 2020
Topic revision: r2 - 13 Jul 2020, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback