With
GPU support, you can significantly improve the performance of a Deep Learning model built with Tensorflow. Running a model with a single
GPU is very simple. All you need to do is to run the model on a node with a
GPU. Tensorflow will automatically detect the
GPU and perform the computation on it without any additional coding. However, running a model using multiple
GPUs on a single node is a bit tricky. In this tutorial, you will learn how to run a Tensorflow-based Deep Learning using multiple
GPUs on the Arc HPC cluster.
1. Log into a node with multiple
GPUs
Use the following command to log into a node with two
GPUs on Arc
$ srun -p gpu2v100 -N 1 -n 1 -t 01:00:00 --pty bash
Note: Replace
gpu2v100
with
gpu1v100
if you want to log into a node with a single
GPU.
2. Create a Python Virtual Environment and install Tensorflow and related packages.
We recommend using Anaconda to create the VE as instructed below (you can rename p39-tf to any name):
$ module load anaconda3
$ module load cuda/toolkit cuda/cudnn
$ conda create -n "p39-tf" python=3.9.0
$ conda activate p39-tf
$ pip install tensorflow keras tensorflow_datasets
Note: For
TensorFlow 1.x, CPU and
GPU packages are separate. For
TensorFlow2, you no longer need to maintain two separate versions for CPU and
GPU.
3. Create a Tensorflow based model with Mirrored Strategy
In order to run a model utilizing multiple
GPUs, we need to use tf.distribute.MirroredStrategy() to create a strategy. Within the scope of the strategy, the input data is evenly distributed across multiple
GPUs. Here is an example model with the distributed mirrored strategy:
import tensorflow_datasets as tfds
import tensorflow as tf
tfds.disable_progress_bar()
import os
os.environ["CUDA_VISIBLE_DEVICES"]='0,1'
datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
mnist_train, mnist_test = datasets['train'], datasets['test']
# Ues the strategy for data parallelization across multi-gpu
strategy = tf.distribute.MirroredStrategy()
num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples
BUFFER_SIZE = 10000
print(f'strategy.num_replicas_in_sync: {strategy.num_replicas_in_sync}')
BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255
return image, label
#Apply this function to the training and test data, shuffle the training data, and batch it for training. Notice we are also keeping an in-memory cache of the training data to improve performance.
train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)
#Within the distributed strategy scope, the data in a batch is evenly distrinuted across the GPUs.
with strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(10)
])
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(),
metrics=['accuracy'])
model.fit(train_dataset, epochs=10)
eval_loss, eval_acc = model.evaluate(eval_dataset)
print('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))
4. Run the model
Assuming you save the example above as test.py. You can use the following command to run the model on a node with two
GPUs (time is for timing the run):
$ time python test.py
Epoch 1/10
2022-04-19 17:39:18.045397: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8201
2022-04-19 17:39:18.775654: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8201
469/469 [==============================] - 10s 4ms/step - loss: 0.2315 - accuracy: 0.9299
Epoch 2/10
469/469 [==============================] - 2s 5ms/step - loss: 0.0654 - accuracy: 0.9794
Epoch 3/10
.....
Epoch 10/10
469/469 [==============================] - 2s 4ms/step - loss: 0.0086 - accuracy: 0.9969
2022-04-19 17:39:44.911466: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:547] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
79/79 [==============================] - 2s 4ms/step - loss: 0.0560 - accuracy: 0.9860
Eval loss: 0.05603577941656113, Eval Accuracy: 0.9860000014305115
real 0m44.520s
user 1m20.748s
sys 0m30.741s
To make sure that both
GPUs are used for the training, you can start a new ssh session and log onto the same
GPU node where your model is running (you are only allowed to ssh to a node where you have a job running). Once you are on the same
GPU node where your model is running, use the "nvidia-smi" command to check the status of the
GPUs. You will see the output looks like the following which indicates both
GPUs are utilized:
$ nvidia-smi
Tue Apr 19 17:36:28 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100S-PCI... Off | 00000000:3B:00.0 Off | Off |
| N/A 31C P0 39W / 250W | 31933MiB / 32510MiB | 23% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100S-PCI... Off | 00000000:D8:00.0 Off | Off |
| N/A 33C P0 40W / 250W | 31933MiB / 32510MiB | 22% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
5. Evaluate the performance
Now we will find out how much the performance is increased by using two
GPUs compared to running the model on a single
GPU. locate the following line in the example:
os.environ["CUDA_VISIBLE_DEVICES"]='0,1'
and change it to;
os.environ["CUDA_VISIBLE_DEVICES"]='0'
The mode now will only use one
GPU on the node. Run the model the with timing:
$ time python mnist-multi-gpus.py
........
........
Epoch 10/10
938/938 [==============================] - 3s 3ms/step - loss: 0.0100 - accuracy: 0.9966
2022-04-19 17:56:08.400111: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:547] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
157/157 [==============================] - 2s 3ms/step - loss: 0.0765 - accuracy: 0.9842
Eval loss: 0.07646699249744415, Eval Accuracy: 0.9842000007629395
real 1m5.604s
user 1m11.126s
sys 0m39.322s
As you can see it takes one minute and five seconds to complete with one
GPU, compared to 45 seconds with two
GPUs.