CUDA is a parallel computing platform and programming model developed by NVIDIA for general-purpose computing on
GPU devices.
CUDA application can dramatically speed up by harnessing the power of
GPUs.
Run multiple CUDA applications on a node with multiple GPU devices
Some
GPU nodes on Shamu are equipped with multiple
GPU devices. Users can program their applications to select a particular
GPU device from the available ones and conduct the computation on it. Here is a sample C code for device selection:
#include <unistd.h>
__global__ void add(int *a, int *b, int *c){
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
#define N 512
int random_ints(int *p, int n){
int i;
for (i =0; i < n; i++)
*p++ = rand();
return 0;
}
int main(int argc, char** argv){
int *a, *b, *c;
int *d_a, *d_b, *d_c;
int size = N * sizeof(int);
int count, dev;
int err;
cudaGetDeviceCount(&count);
if (count == 0){
printf("No cuda device is assigned by Slurm, or You are not on a GPU node\n");
exit(1);
}
printf("device count = %d\n", count);
if (argc < 2){
printf("no device selection in program arguments, set it 0\n");
dev = 0;
}
else{
sscanf(argv[1], "%d", &dev);
printf("Selected Device is %d\n", dev);
}
if (dev >= count){
printf("Selected Device %d is out of range\n", dev);
exit(1);
}
else{
if(err = cudaSetDevice(count-1)){
printf("cudaSetDevice error, %d\n", err);
exit(1);
}
else{
printf("CUDA Program is running on Device %d with data set %s\n", dev, argv[2]);
}
};
cudaMalloc((void**)&d_a, size);
cudaMalloc((void**)&d_b, size);
cudaMalloc((void**)&d_c, size);
a = (int *)malloc(size);
random_ints(a, N);
b = (int*)malloc(size);
random_ints(b, N);
c = (int*)malloc(size);
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
add<<<N,1>>>(d_a, d_b, d_c);
cudaDeviceSynchronize();
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); //cudaMemcpy is a synchronous function, cudaDeviceSynchronize() is not necessary here
printf("The computation is completed on the selected device\n");
free(a);
free(b);
free(c);
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
The sample program takes two arguments: device and data. The program can be compiled and run interactively as following (assuming you want it to run one
GPU 3 on the node):
[abc123@shamu ~]$module load cuda90/toolkit/9.0.176
[abc123@shamu ~]$nvcc device-select.cu -o device-select
[abc123@shamu ~]$./device-select 3 data1
You can submit a batch job to run multiple instances of the program (or multiple
CUDA applications) on a
GPU node, each runs on a different
GPU device simultaneously.
In the following Slurm script, four instances of the program will be running on four different
GPU devices on the same
GPU node at the same time.
Please note, you need to specify
#SBATCH --gres=gpu:k80:4
i
n the script if you need to access four GPU devices.
#!/bin/bash
#
#SBATCH --job-name=cuda_job
#SBATCH --output=my_output_file.txt
#SBATCH --partition="gpu"
#SBATCH --nodes=1
#SBATCH --gres=gpu:k80:4
#SBATCH --ntasks=1
#SBATCH --mail-type=ALL
#SBATCH --mail-user=my-email@utsa.edu
. /etc/profile.d/modules.sh
module load cuda90/toolkit/9.0.176
./device-select 0 data0&
./device-select 1 data1&
./device-select 2 data2&
./device-select 3 data3
The out put file
[abc123@shamu ~]$ cat my_output_file.txt
device count = 4
Selected Device is 0
CUDA Program is running on Device 0 with data set data0
device count = 4
Selected Device is 1
CUDA Program is running on Device 1 with data set data1
device count = 4
Selected Device is 2
CUDA Program is running on Device 2 with data set data2
device count = 4
Selected Device is 3
CUDA Program is running on Device 3 with data set data3
Run multiple Deep Learning models on a node with multiple GPU devices
Running Deep Learning models on a
GPU enabled node can significantly reduce the training time. Most of
GPU nodes on Shamu have multiple
GPU devices installed. To improve efficiency, a user can run more than one model on different
GPU devices simultaneously on a compute node with multiple
GPU devices. Here is an example of Deep Learning model written in Tensorflow and Keras:
import tensorflow as tf
import os
import json as json
from tensorflow.python.client import device_lib
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
# GPU info
print(device_lib.list_local_devices())
# Linear Stack layers
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
#Fully connected layers are defined using the Dense class.
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
checkpoint_path = "training_1/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
save_weights_only=True,
verbose=1)
model.fit(x_train, y_train, epochs=10, callbacks=[cp_callback])
model.evaluate(x_test, y_test, verbose=2)
With
GPU enabled Tensorflow installed on the
GPU node, the model can automatically use the
GPU devices to dramatically improve the performance. The
GPU devices that a model uses is controlled by the environment variable CUDA_VISIBLE_DEVICES. For example, if CUDA_VISIBLE_DEVICES is set to 2, the model will use device 2. A user can control on which
GPU device that the model will be running on by setting the value of the environment variable before running the model. For instance, if device 2 is idle, a user can run the model using the command as shown below:
[abc123@shamu ~]$CUDA_VISIBLE_DEVICES=2 python mnist.py
The following Slurm job script can be use to submit a job to run four models on a
GPU node simultaneously:
#!/bin/bash
#SBATCH --job-name=test_k80_gpu
#SBATCH --partition=gpu
##SBATCH --nodes=1
#SBATCH --gres=gpu:k80:4
#SBATCH --mail-type=ALL
#SBATCH --mail-user=zhiwei.wang@utsa.edu
#SBATCH -t 00:15:00
#SBTACH --tasks=1
. /etc/profile.d/modules.sh
module load cuda90/toolkit/9.0.176
module load anaconda3
conda activate tf-gpu
CUDA_VISIBLE_DEVICES=0 python mnist.py&
CUDA_VISIBLE_DEVICES=1 python mnist.py&
CUDA_VISIBLE_DEVICES=2 python mnist.py&
CUDA_VISIBLE_DEVICES=3 python mnist.py
-- Zhiwei - 03 Aug 2020