You are here: Foswiki>Main Web>CUDA (06 Aug 2020, AdminUser)Edit Attach
CUDA is a parallel computing platform and programming model developed by NVIDIA for general-purpose computing on GPU devices. CUDA application can dramatically speed up by harnessing the power of GPUs.

Run multiple CUDA applications on a node with multiple GPU devices

Some GPU nodes on Shamu are equipped with multiple GPU devices. Users can program their applications to select a particular GPU device from the available ones and conduct the computation on it. Here is a sample C code for device selection:
#include <unistd.h>
__global__ void add(int *a, int *b, int *c){
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];

#define N 512

int random_ints(int *p, int n){
int i;
for (i =0; i < n; i++)
*p++ = rand();
return 0;

int main(int argc, char** argv){

int *a, *b, *c;
int *d_a, *d_b, *d_c;
int size = N * sizeof(int);
int count, dev;
int err;
if (count == 0){
printf("No cuda device is assigned by Slurm, or You are not on a GPU node\n");
printf("device count = %d\n", count);
if (argc < 2){
printf("no device selection in program arguments, set it 0\n");
dev = 0;
sscanf(argv[1], "%d", &dev);
printf("Selected Device is %d\n", dev);
if (dev >= count){
printf("Selected Device %d is out of range\n", dev);
if(err = cudaSetDevice(count-1)){
printf("cudaSetDevice error, %d\n", err);
printf("CUDA Program is running on Device %d with data set %s\n", dev, argv[2]);
cudaMalloc((void**)&d_a, size);
cudaMalloc((void**)&d_b, size);
cudaMalloc((void**)&d_c, size);

a = (int *)malloc(size);
random_ints(a, N);
b = (int*)malloc(size);
random_ints(b, N);
c = (int*)malloc(size);
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
add<<<N,1>>>(d_a, d_b, d_c);
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); //cudaMemcpy is a synchronous function, cudaDeviceSynchronize() is not necessary here
printf("The computation is completed on the selected device\n");

return 0;

The sample program takes two arguments: device and data. The program can be compiled and run interactively as following (assuming you want it to run one GPU 3 on the node):

[abc123@shamu ~]$module load cuda90/toolkit/9.0.176
[abc123@shamu ~]$nvcc -o device-select
[abc123@shamu ~]$./device-select 3 data1

You can submit a batch job to run multiple instances of the program (or multiple CUDA applications) on a GPU node, each runs on a different GPU device simultaneously.

In the following Slurm script, four instances of the program will be running on four different GPU devices on the same GPU node at the same time.  

Please note, you need to specify  

#SBATCH --gres=gpu:k80:4

in the script if you need to access four GPU devices.  
#SBATCH --job-name=cuda_job
#SBATCH --output=my_output_file.txt
#SBATCH --partition="gpu"
#SBATCH --nodes=1
#SBATCH --gres=gpu:k80:4
#SBATCH --ntasks=1
#SBATCH --mail-type=ALL

. /etc/profile.d/

module load cuda90/toolkit/9.0.176
./device-select 0 data0&
./device-select 1 data1&
./device-select 2 data2&
./device-select 3 data3

The out put file

[abc123@shamu ~]$ cat my_output_file.txt 
device count = 4
Selected Device is 0
CUDA Program is running on Device 0 with data set data0

device count = 4
Selected Device is 1
CUDA Program is running on Device 1 with data set data1

device count = 4
Selected Device is 2
CUDA Program is running on Device 2 with data set data2

device count = 4
Selected Device is 3
CUDA Program is running on Device 3 with data set data3

Run multiple Deep Learning models on a node with multiple GPU devices

Running Deep Learning models on a GPU enabled node can significantly reduce the training time. Most of GPU nodes on Shamu have multiple GPU devices installed. To improve efficiency, a user can run more than one model on different GPU devices simultaneously on a compute node with multiple GPU devices. Here is an example of Deep Learning model written in Tensorflow and Keras:
import tensorflow as tf
import os
import json as json
from tensorflow.python.client import device_lib
mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# GPU info

# Linear Stack layers
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
#Fully connected layers are defined using the Dense class.
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')

checkpoint_path = "training_1/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
verbose=1), y_train, epochs=10, callbacks=[cp_callback])

model.evaluate(x_test, y_test, verbose=2)

With GPU enabled Tensorflow installed on the GPU node, the model can automatically use the GPU devices to dramatically improve the performance. The GPU devices that a model uses is controlled by the environment variable CUDA_VISIBLE_DEVICES. For example, if CUDA_VISIBLE_DEVICES is set to 2, the model will use device 2. A user can control on which GPU device that the model will be running on by setting the value of the environment variable before running the model. For instance, if device 2 is idle, a user can run the model using the command as shown below:
[abc123@shamu ~]$CUDA_VISIBLE_DEVICES=2 python

The following Slurm job script can be use to submit a job to run four models on a GPU node simultaneously:
#SBATCH --job-name=test_k80_gpu
#SBATCH --partition=gpu
##SBATCH --nodes=1
#SBATCH --gres=gpu:k80:4
#SBATCH --mail-type=ALL
#SBATCH -t 00:15:00
#SBTACH --tasks=1

. /etc/profile.d/
module load cuda90/toolkit/9.0.176
module load anaconda3
conda activate tf-gpu


-- Zhiwei - 03 Aug 2020
Topic revision: r5 - 06 Aug 2020, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback