Checkpoint-and-RestartForR < ARC

If your R script is expected to run beyond the 72-hour limit on Arc, we suggest implementing a checkpointing-and-restart mechanism in your script. This will help your script in saving its execution state in a file (writing a checkpoint), such that using this saved state, the R script can be restarted at a later point without losing on the previous progress. You could use the Slurm feature of creating job-dependencies to run the checkpointing and restart steps separately (in two separate jobs), one after another.

As an example, let us consider a simple R script that reads a matrix from a file, updates the matrix, and writes the updated matrix to an output file.

Here is the R code that can be pasted in a file named "test.r":

Sys.sleep(120) 
args = commandArgs(trailingOnly=TRUE) 
if (length(args)==0) { 
 dd <- read.csv("input.csv",header = FALSE) 
 mat <- as.matrix(dd) 
 mat <- mat +1 
 print(mat) 
 save(mat, file="m1.Rdata") 
} else if(length(args) == 2) { 
 load(args[1]) 
 mat<-mat+1 
 print(mat) 
 save(mat, file=args[2]) 
} else 
 print("Wrong Command Line Arguments")

The code shown above checks if there are any command-line arguments. If there are no argument, the code will read data from the "input.csv" file, and convert it to a matrix. After updating each element of the matrix, the matrix is saved in a file named "m.Rdata".
If there are two arguments on the command line, the code will read the matrix from the file named as the first argument, and will save it to a file named as the second argument.
Here is the content of input.csv:

$ cat input.csv 
1,1,1,1 
1,1,1,1 
1,1,1,1 
1,1,1,1

A Slurm job script to run the Rscript in batch mode is as follows:

#!/bin/bash
#SBATCH --job-name=X-M # Change to your job name 
#SBATCH --partition=compute1 
#SBATCH --ntasks=1 
#SBATCH --time=00:05:00 
module load R 
Rscript test.r

You can submit the job from a login node, and note its job ID, which can be used for submitting the next job that will start only after this first job has completed:

$ sbatch first_script 
Submitted batch job 8474

The second job-script is shown below. In this case, the Rscript "test.r" is launched with two arguments, one for the name of the output file created by the previous job - here, job # 8474 - and one for the name of the to output from this job:

#!/bin/bash
#SBATCH --job-name=X-M # Change to your job name 
#SBATCH --partition=compute1 
#SBATCH --ntasks=1 
#SBATCH --time=00:05:00 
module load R 
Rscript test.r m1.Rdata m2.Rdata

This second job-script can be submitted while the first job is still running, and can be made dependent on the first job (# 8474):

$sbatch --dependency=afterok:8474 second_script
Submitted batch job 8476

Here, "--dependency=afterok:8474" means that the current job with ID 8476 will run after job 8474 has completed.

You can submit a third-job while the second job is still pending to run. Let us consider the following script as the third job-script:

#!/bin/bash
#SBATCH --job-name=X-M # Change to your job name 
#SBATCH --partition=compute1 
#SBATCH --ntasks=1 
#SBATCH --time=00:05:00 
module load R 
Rscript test.r m2.Rdata m3.Rdata

This script will take the output file created by the second job as input and will produce "m3.data" as output. It can be submitted as follows:

$ sbatch --dependency=afterok:8476 third_script 
Submitted batch job 8477

This topic: ARC > WebHome > ApplicationCheckpointingAndRestartOnArc > Checkpoint-and-RestartForR
Topic revision: 04 Oct 2021, AdminUser

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback