Parallelize Deep Learning Models Across Multiple GPU Devices

Deep Learning models written in Tensorflow can automatically take advantage of a GPU device on a compute node, as long as the Tensorflow package is GPU enabled. We suggest using Anaconda to create a virtual environment and install tensorflow-gpu package in the virtual environment.

First of all, you need to log onto one of the GPU node by typing the following command on the login node:
srun -p gpu --gres=gpu:k80:4 --pty bash

"-p gpu" means that you would like to grab a node in the gpu partition, which includes two nodes with eight K80 GPU devices each. The "GPU" can be changed to "gpu-v100" if you would like to use one of the nodes in gpu-v100 partition, which has one node with four V100 GPU devices. Or use "gpu-v100-shared" which also has one node with four V100 GPU devices, but up to four users can log onto the node at the same time. "--gres=gpu:k80:4" indicates that four GPU devices will be used. That number can be up to eight for gpu partition, and four for both gpu-v100 and gpu-v100-shared.

Once you log onto a GPU node, please refer to https://hpcsupport.utsa.edu/foswiki/bin/view/Main/PythonVmsInAnaconda for activating Anaconda and creating a virtual environment. The following command will install tensorflow-gpu package in your virtual environment.
module load cuda90/toolkit/9.0.176
conda install -c anaconda tensorflow-gpu

The coda90 module is needed for both installing and using the package.
Note: By default, tensorflow-gpu will grab all available GPU devices on the node, even only one is used in your model code. Make sure to add the following lines in your code if one GPU is needed:
import os​
os.environ["CUDA_VISIBLE_DEVICES"]=”1"​


"1" is the GPU device ID, which starts from 0 to n-1, where n is the GPU device's total number on the node. You can decide which one to use by running the command below, which shows the available device status and IDs
[abc123@gpu02 ~]$ nvidia-smi

If you want to run a DP model across multiple devices, make sure you grab enough GPU devices (say, four GPU devices are needed) when logging onto the node with --gres=gpu:k80:4. And in your code, put in the following lines to have the model to access for GPU devices:
import os​
os.environ["CUDA_VISIBLE_DEVICES"]=”0,1,2,3"​

Tensorflow comes with a build-in mechanism to train a Deep Learning model across multiple GPU devices on the same computer or node. Here is a sample Deep Leaning model written in Keras. In this example, .the leaning model is duplicated to all the GPU devices. Each device runs the same model in the training stage with a portion of the training data set. The training speed is significantly improved due to parallelization.
import tensorflow_datasets as tfds
import tensorflow as tf
tfds.disable_progress_bar()
import os

os.environ["CUDA_VISIBLE_DEVICES"]='0,1,2,3'
datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
print(f'dataset:\n {datasets}')
print(f'inof:\n {info}')

mnist_train, mnist_test = datasets['train'], datasets['test']
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples

BUFFER_SIZE = 10000

print(f'strategy.num_replicas_in_sync: {strategy.num_replicas_in_sync}')

BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

print(f'batch size: {BATCH_SIZE}')


#Pixel values, which are 0-255, have to be normalized to the 0-1 range. Define this scale in a function.
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255
return image, label

#Apply this function to the training and test data, shuffle the training data, and batch it for training. Notice we are also keeping an in-memory cache of the training data to improve performance.

train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)


#Create and compile the Keras model in the context of strategy.scope.
with strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1000, activation='relu'),
tf.keras.layers.Dense(1000, activation='relu'),
tf.keras.layers.Dense(10)
])

model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(),
metrics=['accuracy'])

model.fit(train_dataset, epochs=20)

eval_loss, eval_acc = model.evaluate(eval_dataset)
print('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))

Run the model on one of the node in gpu partition, and time it as following:
>time python mnist-multi-gpus.py 
235/235 [==============================] - 2s 10ms/step - accuracy: 0.9197 - loss: 0.2582
Epoch 2/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9791 - loss: 0.0685
Epoch 3/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9859 - loss: 0.0459
Epoch 4/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9889 - loss: 0.0349
Epoch 5/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9917 - loss: 0.0242
Epoch 6/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9933 - loss: 0.0201
Epoch 7/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9952 - loss: 0.0137
Epoch 8/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9964 - loss: 0.0112
Epoch 9/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9960 - loss: 0.0121
Epoch 10/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9969 - loss: 0.0093
Epoch 11/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9971 - loss: 0.0093
Epoch 12/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9977 - loss: 0.0064
Epoch 13/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9978 - loss: 0.0067
Epoch 14/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9979 - loss: 0.0065
Epoch 15/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9976 - loss: 0.0069
Epoch 16/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9980 - loss: 0.0074
Epoch 17/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9973 - loss: 0.0078
Epoch 18/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9980 - loss: 0.0060
Epoch 19/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9985 - loss: 0.0045
Epoch 20/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9982 - loss: 0.0053
40/40 [==============================] - 0s 12ms/step - accuracy: 0.9866 - loss: 0.0625
Eval loss: 0.06250440329313278, Eval Accuracy: 0.9865999817848206

real 0m46.234s
user 1m28.432s
sys 0m22.238s


If change the device line to following:
import os​
os.environ["CUDA_VISIBLE_DEVICES"]=”0"​

And run the mode with time:
>time python mnist-multi-gpus.py 
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9483 - loss: 0.1615
Epoch 2/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9826 - loss: 0.0571
Epoch 3/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9883 - loss: 0.0384
Epoch 4/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9912 - loss: 0.0294
Epoch 5/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9930 - loss: 0.0226
Epoch 6/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9947 - loss: 0.0175
Epoch 7/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9950 - loss: 0.0158
Epoch 8/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9962 - loss: 0.0120
Epoch 9/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9967 - loss: 0.0119
Epoch 10/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9965 - loss: 0.0109
Epoch 11/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9966 - loss: 0.0118
Epoch 12/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9977 - loss: 0.0071
Epoch 13/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9977 - loss: 0.0085
Epoch 14/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9980 - loss: 0.0074
Epoch 15/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9979 - loss: 0.0076
Epoch 16/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9980 - loss: 0.0075
Epoch 17/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9979 - loss: 0.0077
Epoch 18/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9982 - loss: 0.0062
Epoch 19/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9976 - loss: 0.0090
Epoch 20/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9984 - loss: 0.0052
157/157 [==============================] - 0s 3ms/step - accuracy: 0.9875 - loss: 0.0757
Eval loss: 0.07569707930088043, Eval Accuracy: 0.987500011920929

real 1m27.720s
user 1m46.541s
sys 0m26.354s

The training speed with 4 GPU device is almost twice as faster as the speed trained with one GPU.

Parallelize Deep Learning Models on a Multiple-CPU node

GPU devices can speed up the training processes, as shown in the previous session. However, GPU-enabled compute nodes may not be available due to limited resources. In this case, users can run a DP model on a common compute node, which usually has 40 to 80 cores on Shamu cluster. A DP model written by Tensorflow or Keras automatically runs across multiple CPU cores available on the node to improve training speed. In our experiments, although there are no GPU devices on the common compute node, GPU Tensorflow can make a DP model runs much faster than using non-GPU Tensorflow. Our tests show that a typical DP model (6 layers, 12800 nodes per layer) running on an 80-core CPU only compute node takes only twice as much time as it running on a V100 GPU device. More importantly, the memory space (300G+) on the CPU-based environment is much larger than it on a GPU device (32G max for the V100 GPU on Shamu), which allows the model with a large data set to run on CPU-based environment without the out of memory issues.

-- AdminUser - 17 Sep 2020
Topic revision: r5 - 16 Oct 2020, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback