You are here: Foswiki>Main Web>OtherTools (12 Jul 2020, AdminUser)Edit Attach

Checkpoint and Restart

Checkpointing is the action of saving the state of a running process to a checkpoint image file. Restart is the actions to resume the checkpointed application from saved state.

Checkpointing not only offers the capability to resume running without losing too much valuable CPU time in case of system failure, it also offers the following conveniences, including process migration, process replication, extended sessions, debugging, and fast startup, etc.

DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing tool for distributed applications. DMTCP is available on Shamu via a module named dmtcp/2.6.0. We are testing the tool on Shamu. It works for Sequential and Multi-threading jobs both via interactive sessions and batch jobs. It does not work for MPI based applications at this moment.

Checkpoint and Restart Sequential and Multi-threading Applications Interactively (non Batch)

To checkpoint and restart an interactive job, follow the steps below:

- log onto a compute node from the login node.
srun --pty bash

- Load the dmtcp module
module load dmtcp/2.6.0

- On the compute node, start the dmtcp coordinator as a daemon (running in the background). The coordinator is essential to the whole process:
dmtcp_coordinator -i 600 --exit-on-last --daemon

where 600 means checkpoint occurs every 600 seconds. You can change it to a number that suits your application. --daemon option tells the coordinator to run in the background so that you do not need to start another shell. --exist-on-last option tells the coordinator to exit when the application exits.

- Launch your application with the dmtcp_launch
dmtcp_launch your-app

Your application will save a checkpoint image every 600 seconds. If something happens before the application finishes the work, you can resume the execution from the last checkpoint.

To restart a checkpointed application, follow the steps below:

- Start the dmtcp coordinator as you did for starting the application
dmtcp_coordinator -i 600 --exit-on-last --daemon

- Restart the application from the same directory where you ran the application the first time. This is essential as all the checkpoint images and restart script are save there.
./dmtcp_restart_script.sh

Checkpoint and Restart Sequential and Multi-threading Batch Jobs

Here is a Slurm job script for submit a job with checkpoint feature:
#!/bin/bash
# Put your SLURM options here
#SBATCH --partition=defq # change to proper partition name or remove
#SBATCH --time=00:15:00 # put proper time of reservation here
#SBATCH --nodes=1 # number of nodes
##SBATCH --ntasks-per-node=4 # processes per node
##SBATCH --mem=24000 # memory resource
#SBATCH --job-name="dmtcp_job" # change to your job name
#SBATCH --output=dmtcp.out # change to proper file name or remove for defaults
# ? Any other batch options ?


export DMTCP_DL_PLUGIN=0


#----------------------------- Set up DMTCP environment for a job ------------#


start_coordinator()
{

fname=dmtcp_command.$SLURM_JOBID
h=`hostname`

check_coordinator=`which dmtcp_coordinator`
if [ -z "$check_coordinator" ]; then
echo "No dmtcp_coordinator found. Check your DMTCP installation and PATH settings."
exit 0
fi

dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname $@ 1>/dev/null 2>&1

while true; do
if [ -f "$fname" ]; then
p=`cat $fname`
if [ -n "$p" ]; then
# try to communicate ? dmtcp_command -p $p l
break
fi
fi
done

# Create dmtcp_command wrapper for easy communication with coordinator
p=`cat $fname`
chmod +x $fname
echo "#!/bin/bash" > $fname
echo >> $fname
echo "export PATH=$PATH" >> $fname
echo "export DMTCP_COORD_HOST=$h" >> $fname
echo "export DMTCP_COORD_PORT=$p" >> $fname
echo "dmtcp_command \$@" >> $fname

# Set up local environment for DMTCP
export DMTCP_COORD_HOST=$h
export DMTCP_COORD_PORT=$p

}


# changedir to workdir
cd $SLURM_SUBMIT_DIR


#----------------------------------- Set up job environment ------------------#


. /etc/profile.d/modules.sh
module load dmtcp/2.6.0
#load your other modules here




#------------------------------------- Launch application ---------------------#

################################################################################
# 1. Start DMTCP coordinator
################################################################################

start_coordinator -i 600 # -i 120 ... <put dmtcp coordinator options here>

dmtcp_launch --ckpt-signal 10 ./<app-binary>.

Here is a Slurm job script for restarting a checkpointed job:
#!/bin/bash
# Put your SLURM options here
#SBATCH --partition=gpu # change to proper partition name or remove
#SBATCH --time=00:15:00 # put proper time of reservation here
#SBATCH --nodes=1 # number of nodes
##SBATCH --ntasks-per-node=4 # processes per node
##SBATCH --mem=24000 # memory resource
#SBATCH --job-name="dmtcp_job" # change to your job name
#SBATCH --output=dmtcp.out # change to proper file name or remove for defaults
# ? Any other batch options ?

#SBATCH --ntasks=5

#----------------------------- Set up DMTCP environment for a job ------------#

###############################################################################
# Start DMTCP coordinator on the launching node. Free TCP port is automatically
# allocated. This function creates a dmtcp_command.$JOBID script, which serves
# as a wrapper around dmtcp_command. The script tunes dmtcp_command for the
# exact dmtcp_coordinator (its hostname and port). Instead of typing
# "dmtcp_command -h <coordinator hostname> -p <coordinator port> <command>",
# you just type "dmtcp_command.$JOBID <command>" and talk to the coordinator
# for JOBID job.
###############################################################################

start_coordinator()
{
############################################################
# For debugging when launching a custom coordinator, uncomment
# the following lines and provide the proper host and port for
# the coordinator.
############################################################
# export DMTCP_COORD_HOST=$h
# export DMTCP_COORD_PORT=$p
# return

fname=dmtcp_command.$SLURM_JOBID
h=`hostname`

check_coordinator=`which dmtcp_coordinator`
if [ -z "$check_coordinator" ]; then
echo "No dmtcp_coordinator found. Check your DMTCP installation and PATH settings"
exit 0
fi

dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname $@ 1>/dev/null 2>&1

while true; do
if [ -f "$fname" ]; then
p=`cat $fname`
if [ -n "$p" ]; then
# try to communicate ? dmtcp_command -p $p l
break
fi
fi
done

# Create a dmtcp_command wrapper for easy communication with the coordinator.
p=`cat $fname`
chmod +x $fname
echo "#!/bin/bash" > $fname
echo >> $fname
echo "export PATH=$PATH" >> $fname
echo "export DMTCP_COORD_HOST=$h" >> $fname
echo "export DMTCP_COORD_PORT=$p" >> $fname
echo "dmtcp_command \$@" >> $fname

# Set up local environment for DMTCP
export DMTCP_COORD_HOST=$h
export DMTCP_COORD_PORT=$p

}

#----------------------- Some rutine steps and information output -------------------------#

###################################################################################
# Print out the SLURM job information. Remove this if you don't need it.
###################################################################################
echo "SLURM_JOBID="$SLURM_JOBID
echo "SLURM_JOB_NODELIST"=$SLURM_JOB_NODELIST
echo "SLURM_NNODES"=$SLURM_NNODES
echo "SLURMTMPDIR="$SLURMTMPDIR
echo "working directory = "$SLURM_SUBMIT_DIR

# changedir to workdir
cd $SLURM_SUBMIT_DIR

#----------------------------------- Set up job environment ------------------#


export PATH=/work/iqr224/dmtcp/bin:$PATH
export LD_LIBRARY_PATH=/work/iqr224/dmtcp/lib:$LD_LIBRARY_PATH

. /etc/profile.d/modules.sh
module load dmtcp/2.6.0
#load your other modules here


#------------------------------------- Launch application ---------------------#

################################################################################
# 1. Start DMTCP coordinator
################################################################################

start_coordinator -i 600 # -i 120 ... <put dmtcp coordinator options here>

################################################################################
# 2. Restart application
################################################################################

/bin/bash ./dmtcp_restart_script.sh -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT

Write a C/C++ Program with Embedded Checkpoint-and-Restart

In the previous examples, the checkpoint action is controlled by the coordinator, either by -i number_of_second option or by manually type in 'c' in the coordinator screen if the coordinator is not launched as a daemon in a separate shell. The checkpoint action can occur at any point during the execution of an application. In some cases, users may want to control when the checkpoint actions take place during the execution. Users can achieve this goal by embedding some DMTCP routines. Here is an example how a C program with embedded checkpoint-and-restart:
#include <stdlib.h>
#include <assert.h>
#include <stdio.h>

/* Be sure to compile with -I<path>; see Makefile in this directory. */
#include "dmtcp.h"

#define INTS_PER_LOOP 5

// Prints a sequence of n integers starting 0 to both the screnn and the file out.txt
// at a rate of 1 character integer second.
// Checkpoint occurs every INTS_PER_LOOP iterations

int main(int argc, char* argv[])
{
unsigned long i = 0;
int count = 0;
int rr;

int numCheckpoints, numRestarts;
FILE *f;
f = fopen("out.txt","w");
while (i<100)
{
if(dmtcp_is_enabled()){
dmtcp_get_local_status(&numCheckpoints, &numRestarts);
printf("on iteration %d: this process has checkpointed"
" %d times and restarted %d times\n",
++count, numCheckpoints, numRestarts);
}else{
printf("on iteration %d; DMTCP not enabled!\n", ++count);
}
do {
printf("%d ", i);
fflush(stdout);
fprintf(f, "%d\n",i);
fflush(f);
sleep(1);
i++;
} while (i % INTS_PER_LOOP != 0);
printf("\n");
// Checkpoint and print result
if(dmtcp_is_enabled()){
printf("\n");
rr = dmtcp_checkpoint();
if(rr == DMTCP_NOT_PRESENT)
printf("***** Error, DMTCP not running; checkpoint skipped ***** \n");
if(rr == DMTCP_AFTER_CHECKPOINT)
printf("***** after checkpoint *****\n");
if(rr == DMTCP_AFTER_RESTART)
printf("***** after restart *****\n");
}else{
printf(" dmtcp disabled -- nevermind\n");
}

}
fclose(f);
return 0;
}

Here is the Makefile for compiling:
ifndef CC
CC=gcc
endif

your-program : your-program.c
${CC} -fPIC ${CFLAGS} -I${DMTCP_HOME}/include your-program.c -o your-progam

The run the above application with DMTCP:
dmtcp_coordinator --daemon --exit-on-last
dmtcp_launch ./your-application

The output looks like below screenshot:
on iteration 1: this process has checkpointed 0 times and restarted 0 times
0 1 2 3 4

***** after checkpoint *****
on iteration 2: this process has checkpointed 1 times and restarted 0 times
5 6 7 8 9

You can use control-c to terminate the execution, and restart it form the last checkpoint as the following:
dmtcp_coordinator --daemon --exit-on-last
dmtcp_restart ckpt*

The output looks like below screenshot:
on iteration 5: this process has checkpointed 3 times and restarted 1 times
20 21 22 23 24

***** after checkpoint *****
on iteration 6: this process has checkpointed 4 times and restarted 1 times
25 26 27 28 29

-- Zhiwei - 08 Jul 2020
Topic revision: r1 - 12 Jul 2020, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback