With GPU support, you can significantly improve the performance of a Deep Learning model built with Tensorflow. Running a model with a single GPU is very simple. All you need to do is to run the model on a node with a GPU. Tensorflow will automatically detect the GPU and perform the computation on it without any additional coding. However, running a model using multiple GPUs on a single node is a bit tricky. In this tutorial, you will learn how to run a Tensorflow-based Deep Learning using multiple GPUs on the Arc HPC cluster.

1. Log into a node with multiple GPUs

Use the following command to log into a node with two GPUs on Arc

$ srun -p gpu2v100 -N 1 -n 1 -t 01:00:00 --pty bash

Note: Replace gpu2v100 with gpu1v100 if you want to log into a node with a single GPU.

2. Create a Python Virtual Environment and install Tensorflow and related packages.
We recommend using Anaconda to create the VE as instructed below (you can rename p39-tf to any name):

$ module load anaconda3
$ module load cuda/toolkit cuda/cudnn
$ conda create -n "p39-tf" python=3.9.0
$ conda activate p39-tf
$ pip install tensorflow keras tensorflow_datasets

Note: For TensorFlow 1.x, CPU and GPU packages are separate. For TensorFlow2, you no longer need to maintain two separate versions for CPU and GPU.

3. Create a Tensorflow based model with Mirrored Strategy

In order to run a model utilizing multiple GPUs, we need to use tf.distribute.MirroredStrategy() to create a strategy. Within the scope of the strategy, the input data is evenly distributed across multiple GPUs. Here is an example model with the distributed mirrored strategy:

import tensorflow_datasets as tfds
import tensorflow as tf
tfds.disable_progress_bar()
import os

os.environ["CUDA_VISIBLE_DEVICES"]='0,1'
datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)

mnist_train, mnist_test = datasets['train'], datasets['test']

# Ues the strategy for data parallelization across multi-gpu
strategy = tf.distribute.MirroredStrategy()

num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples

BUFFER_SIZE = 10000

print(f'strategy.num_replicas_in_sync: {strategy.num_replicas_in_sync}')

BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255
return image, label

#Apply this function to the training and test data, shuffle the training data, and batch it for training. Notice we are also keeping an in-memory cache of the training data to improve performance.

train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)

#Within the distributed strategy scope, the data in a batch is evenly distrinuted across the GPUs.
with strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(10)
])

model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(),
metrics=['accuracy'])

model.fit(train_dataset, epochs=10)

eval_loss, eval_acc = model.evaluate(eval_dataset)
print('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))

4. Run the model

Assuming you save the example above as test.py. You can use the following command to run the model on a node with two GPUs (time is for timing the run):

$ time python test.py
Epoch 1/10
2022-04-19 17:39:18.045397: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8201
2022-04-19 17:39:18.775654: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8201
469/469 [==============================] - 10s 4ms/step - loss: 0.2315 - accuracy: 0.9299
Epoch 2/10
469/469 [==============================] - 2s 5ms/step - loss: 0.0654 - accuracy: 0.9794
Epoch 3/10
.....
Epoch 10/10
469/469 [==============================] - 2s 4ms/step - loss: 0.0086 - accuracy: 0.9969
2022-04-19 17:39:44.911466: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:547] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
79/79 [==============================] - 2s 4ms/step - loss: 0.0560 - accuracy: 0.9860
Eval loss: 0.05603577941656113, Eval Accuracy: 0.9860000014305115

real 0m44.520s
user 1m20.748s
sys 0m30.741s




To make sure that both GPUs are used for the training, you can start a new ssh session and log onto the same GPU node where your model is running (you are only allowed to ssh to a node where you have a job running). Once you are on the same GPU node where your model is running, use the "nvidia-smi" command to check the status of the GPUs. You will see the output looks like the following which indicates both GPUs are utilized:

$ nvidia-smi
Tue Apr 19 17:36:28 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100S-PCI... Off | 00000000:3B:00.0 Off | Off |
| N/A 31C P0 39W / 250W | 31933MiB / 32510MiB | 23% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100S-PCI... Off | 00000000:D8:00.0 Off | Off |
| N/A 33C P0 40W / 250W | 31933MiB / 32510MiB | 22% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

5. Evaluate the performance

Now we will find out how much the performance is increased by using two GPUs compared to running the model on a single GPU. locate the following line in the example:

os.environ["CUDA_VISIBLE_DEVICES"]='0,1'

and change it to;

os.environ["CUDA_VISIBLE_DEVICES"]='0'

The mode now will only use one GPU on the node. Run the model the with timing:

$ time python mnist-multi-gpus.py 
........
........
Epoch 10/10
938/938 [==============================] - 3s 3ms/step - loss: 0.0100 - accuracy: 0.9966
2022-04-19 17:56:08.400111: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:547] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
157/157 [==============================] - 2s 3ms/step - loss: 0.0765 - accuracy: 0.9842
Eval loss: 0.07646699249744415, Eval Accuracy: 0.9842000007629395

real 1m5.604s
user 1m11.126s
sys 0m39.322s

As you can see it takes one minute and five seconds to complete with one GPU, compared to 45 seconds with two GPUs.
Topic revision: r3 - 20 Apr 2022, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding UTSA Research Support Group? Send feedback