To checkpoint and restart an interactive job, follow the steps below:
- log onto a compute node from the login node.
srun --pty bash
- Load the dmtcp module
module load dmtcp/2.6.0
where 600 means checkpoint occurs every 600 seconds. You can change it to a number that suits your application. --daemon option tells the coordinator to run in the background so that you do not need to start another shell. --exist-on-last option tells the coordinator to exit when the application exits.
- Launch your application with the dmtcp_launch
dmtcp_launch -i 600 your-app
where 600 means checkpoint occurs every 600 seconds. You can change it to a number that suits your application. --daemon option tells the coordinator to run in the background so that you do not need to start another shell. --exist-on-last option tells the coordinator to exit when the application exits. Your application will save a checkpoint image every 600 seconds. If something happens before the application finishes the work, you can resume the execution from the last checkpoint.
To restart a checkpointed application, follow the steps below:
- Start the dmtcp coordinator as you did for starting the application
dmtcp_coordinator -i 600 --exit-on-last --daemon
- Restart the application from the same directory where you ran the application the first time. This is essential as all the checkpoint images and restart script are save there.
dmtcp_restart -i 600 ckp*
-- Zhiwei -- 12 Jul 2020