We have the GPU version of TensorFlow installed on the GPU nodes with the Python 3.6.1 module install (native Python 2.7 version is currently not working). This document will describe how to use TensorFlow and perform a quick test. Since we have limited GPU nodes, we are also providing instructions on how to use the CPU version of TensorFlow built within your home directory.

TensorFlow CPU Version

First, grab a compute node with qlogin and start a Python Virtualenv environment:

[abc123@login-0-0 ~]$ qlogin
local configuration login-0-0.cm.cluster not defined - using global configuration
Your job 71050 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 71050 has been successfully scheduled.
Establishing /cm/shared/apps/sge/var/cm/qlogin_wrapper session to host compute045.cm.cluster ...

[abc123@compute045 ~]$ virtualenv --system-site-packages tensorflow-cpu
New python executable in tensorflow-cpu/bin/python
Installing Setuptools..............................................................................................................................................................................................................................done.
Installing Pip.....................................................................................................................................................................................................................................................................................................................................done.

Activate the newly created Python virtualenv:

[abc123@compute045 ~]$ source ~/tensorflow-cpu/bin/activate
(tensorflow-cpu)[abc123@compute045 ~]$

Now we need to upgrade some default packages which are required by TensorFlow:

(tensorflow-cpu)[abc123@compute045 ~]$ pip install --upgrade setuptools
(tensorflow-cpu)[abc123@compute045 ~]$ pip install --upgrade enum34 futures

Now lets download, compile and install TensorFlow into your Python virtualenv:

(tensorflow-cpu)[abc123@compute045 ~]$ pip install --upgrade tensorflow

Once it is installed, lets perform a quick test to verfiy the CPU version is working (you can disregard the warning about AVX2 and FMA extensions):

(tensorflow-cpu)[abc123@compute045 ~]$ python
Python 2.7.5 (default, Aug 4 2017, 00:39:18)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2018-05-09 10:29:54.945974: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
>>> print(sess.run(hello))
Hello, TensorFlow!
>>>

Once TensorFlow is installed via the above instructions, you do not need to repeat the process. All you have to do is source the "activate" file the next time you want to run TensorFlow.

To disable the above warning message, add this to your Python code:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import tensorflow as tf

TensorFlow GPU Version

Grab a GPU node with qlogin (note - you first need permission to use the GPU nodes. If you require access to the GPU nodes send an email to rcsg@utsa.edu to request access):

[abc123@login-0-0 ~]$ qlogin -q gpu.q
local configuration login-0-0.cm.cluster not defined - using global configuration
Your job 54721 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 54721 has been successfully scheduled.
Establishing /cm/shared/apps/sge/var/cm/qlogin_wrapper session to host gpu02.cm.cluster ...
Last login: Mon Feb 26 16:31:34 2018 from login-0-0.cm.cluster

Load the modules required for TensorFlow (CUDA, cudNN, etc..)

[abc123@gpu02 ~]$ module load python/3.6.1 cuda90/toolkit cudnn/7.0

Enter the python shell:

[abc123@gpu02 ~]$ python3
Python 3.6.1 (default, May 16 2017, 15:27:50)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

Enter the following Hello World test program:

>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2018-02-27 08:14:55.501487: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-02-27 08:14:56.177280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:06:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:56.412525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 1 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:07:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:56.654330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 2 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:0a:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:56.902953: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 3 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:0b:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:57.180367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 4 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:86:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:57.465869: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 5 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:87:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:57.768378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 6 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:8a:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:58.047947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 7 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:8b:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:58.052993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Device peer to peer matrix
2018-02-27 08:14:58.053262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1126] DMA: 0 1 2 3 4 5 6 7
2018-02-27 08:14:58.053276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 0: Y Y Y Y N N N N
2018-02-27 08:14:58.053284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 1: Y Y Y Y N N N N
2018-02-27 08:14:58.053292: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 2: Y Y Y Y N N N N
2018-02-27 08:14:58.053300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 3: Y Y Y Y N N N N
2018-02-27 08:14:58.053308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 4: N N N N Y Y Y Y
2018-02-27 08:14:58.053321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 5: N N N N Y Y Y Y
2018-02-27 08:14:58.053328: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 6: N N N N Y Y Y Y
2018-02-27 08:14:58.053336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 7: N N N N Y Y Y Y
2018-02-27 08:14:58.053357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:06:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053379: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:07:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:0a:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:0b:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:4) -> (device: 4, name: Tesla K80, pci bus id: 0000:86:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:5) -> (device: 5, name: Tesla K80, pci bus id: 0000:87:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053422: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:6) -> (device: 6, name: Tesla K80, pci bus id: 0000:8a:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:7) -> (device: 7, name: Tesla K80, pci bus id: 0000:8b:00.0, compute capability: 3.7)
>>> print(sess.run(hello))
Hello, TensorFlow!
>>>

If you want to disable the above warning messages, add this to your code:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import tensorflow as tf

-- Jeremy - 09 May 2018

Important Note: By default, TensorFlow takes all GPU resources on a GPU node (8 GPU units on each GPU nodes on Shamu). It is required to change your code to avoid hogging the GPU node. For example, you need to put in the followings in your code if only GPU unit 1 and 3 are needed:

import os​
os.environ["CUDA_VISIBLE_DEVICES"]=”1,3"​

To check the availability of the GPU units on a node, use the following command:
[abc123@gpu02 ~]$ nvidia-smi

-- Zhiwei - 01 Oct. 2018

Deep Learning Model with Checkpoint and Restart

By using the checkpoint feature, model progress can be saved during training. The model can resume training where it left off and avoid starting from scratch if something happens during the training.

Here is an example code with a checkpoint and restart feature. This mode is designed to solve the MNIST handwritten digit classification problem. The training dataset is included in the Keras package and can be load by calling mnist.load_data() function.

Executing the following lines of code in a Jopyter notebooks environment shows the 130th image in the dataset:

import tensorflow as tf

import matplotlib.pyplot as plt

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

plt.imshow(x_train[130])plt.imshow(x_train[130])

The MNIST handwritten digit classification model with a checkpoint and restart feature:

import tensorflow as tf
from keras.callbacks import ModelCheckpoint
import os.path
from os import path

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

filename = "mymodel.h5"

# check if checkpoint file exists. if does, load the model and skip building the model
if (path.isfile(filename)):
print("Resuming")
model = tf.keras.models.load_model(filename)
else:
print('Build the model from scratch')

model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

checkpoint = ModelCheckpoint(filename, monitor='loss', verbose=1, save_best_only=True, mode='min')

model.fit(x_train, y_train, epochs=5, batch_size = 1000, validation_split = 0.1, callbacks=[checkpoint])

model.evaluate(x_test, y_test, verbose=2)

When the model is trained the first time, it will build the model from scratch as there is no checkpoint file yet. The output looks like the following:
Using TensorFlow backend.
Build the model from scratch
2020-06-24 09:41:42.120914: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2020-06-24 09:41:42.121145: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 4. Tune using inter_op_parallelism_threads for best performance.
Train on 54000 samples, validate on 6000 samples
Epoch 1/5
51000/54000 [===========================>..] - ETA: 0s - loss: 0.9134 - accuracy: 0.7435
Epoch 00001: loss improved from inf to 0.88658, saving model to mymodel.h5
54000/54000 [==============================] - 2s 44us/sample - loss: 0.8866 - accuracy: 0.7507 - val_loss: 0.3239 - val_accuracy: 0.9155
Epoch 2/5
51000/54000 [===========================>..] - ETA: 0s - loss: 0.3818 - accuracy: 0.8910
Epoch 00002: loss improved from 0.88658 to 0.37801, saving model to mymodel.h5
54000/54000 [==============================] - 1s 27us/sample - loss: 0.3780 - accuracy: 0.8920 - val_loss: 0.2441 - val_accuracy: 0.9340
Epoch 3/5
52000/54000 [===========================>..] - ETA: 0s - loss: 0.3066 - accuracy: 0.9125
Epoch 00003: loss improved from 0.37801 to 0.30647, saving model to mymodel.h5
54000/54000 [==============================] - 1s 24us/sample - loss: 0.3065 - accuracy: 0.9125 - val_loss: 0.2033 - val_accuracy: 0.9455
Epoch 4/5
52000/54000 [===========================>..] - ETA: 0s - loss: 0.2621 - accuracy: 0.9250
Epoch 00004: loss improved from 0.30647 to 0.26205, saving model to mymodel.h5
54000/54000 [==============================] - 1s 27us/sample - loss: 0.2620 - accuracy: 0.9245 - val_loss: 0.1770 - val_accuracy: 0.9540
Epoch 5/5
53000/54000 [============================>.] - ETA: 0s - loss: 0.2324 - accuracy: 0.9341
Epoch 00005: loss improved from 0.26205 to 0.23274, saving model to mymodel.h5
54000/54000 [==============================] - 1s 26us/sample - loss: 0.2327 - accuracy: 0.9341 - val_loss: 0.1583 - val_accuracy: 0.9595
10000/1 - 1s - loss: 0.1237 - accuracy: 0.9462

Process finished with exit code 0

When the model is executed again in the same directory, The model is loaded from the checkpoint file and continues the training from there it was left off. The output looks like the following:
Using TensorFlow backend.
Resuming
2020-06-24 10:11:01.443935: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2020-06-24 10:11:01.445295: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 4. Tune using inter_op_parallelism_threads for best performance.
Train on 54000 samples, validate on 6000 samples
Epoch 1/5
51000/54000 [===========================>..] - ETA: 0s - loss: 0.2098 - accuracy: 0.9394
Epoch 00001: loss improved from inf to 0.20846, saving model to mymodel.h5
54000/54000 [==============================] - 2s 38us/sample - loss: 0.2085 - accuracy: 0.9398 - val_loss: 0.1432 - val_accuracy: 0.9615
Epoch 2/5
53000/54000 [============================>.] - ETA: 0s - loss: 0.1888 - accuracy: 0.9464
Epoch 00002: loss improved from 0.20846 to 0.18883, saving model to mymodel.h5
54000/54000 [==============================] - 1s 25us/sample - loss: 0.1888 - accuracy: 0.9464 - val_loss: 0.1319 - val_accuracy: 0.9667
Epoch 3/5
51000/54000 [===========================>..] - ETA: 0s - loss: 0.1723 - accuracy: 0.9505
Epoch 00003: loss improved from 0.18883 to 0.17294, saving model to mymodel.h5
54000/54000 [==============================] - 1s 27us/sample - loss: 0.1729 - accuracy: 0.9503 - val_loss: 0.1226 - val_accuracy: 0.9672
Epoch 4/5
51000/54000 [===========================>..] - ETA: 0s - loss: 0.1602 - accuracy: 0.9532
Epoch 00004: loss improved from 0.17294 to 0.15976, saving model to mymodel.h5
54000/54000 [==============================] - 1s 25us/sample - loss: 0.1598 - accuracy: 0.9535 - val_loss: 0.1155 - val_accuracy: 0.9705
Epoch 5/5
52000/54000 [===========================>..] - ETA: 0s - loss: 0.1500 - accuracy: 0.9570
Epoch 00005: loss improved from 0.15976 to 0.14921, saving model to mymodel.h5
54000/54000 [==============================] - 2s 28us/sample - loss: 0.1492 - accuracy: 0.9574 - val_loss: 0.1088 - val_accuracy: 0.9710
10000/1 - 1s - loss: 0.0827 - accuracy: 0.9642

Process finished with exit code 0

-- Zhiwei - 24 Jun. 2020

Deep Learning Model Parallelization Across Multiple GPUs

Parallelizing a Deep Learning model across multiple GPUs can improve training speed. It can also potentially solve the lack of memory problems by distributing the training dataset.

Here is an example:
import tensorflow_datasets as tfds
import tensorflow as tf
tfds.disable_progress_bar()

import os

print(tf.__version__)
#import os
os.environ["CUDA_VISIBLE_DEVICES"]='0,1,2,3'
datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
print(f'dataset:\n {datasets}')
print(f'inof:\n {info}')

mnist_train, mnist_test = datasets['train'], datasets['test']

print(mnist_train)
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

# You can also do info.splits.total_num_examples to get the total
# number of examples in the dataset.

num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples

BUFFER_SIZE = 10000

print(f'strategy.num_replicas_in_sync: {strategy.num_replicas_in_sync}')

BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

print(f'batch size: {BATCH_SIZE}')

#Pixel values, which are 0-255, have to be normalized to the 0-1 range. Define this scale in a function.
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255

return image, label

#Apply this function to the training and test data, shuffle the training data, and batch it for training. Notice we are also keeping an in-memory cache of the training data to improve performance.

train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)


#Create and compile the Keras model in the context of strategy.scope.
with strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1000, activation='relu'),
tf.keras.layers.Dense(1000, activation='relu'),
tf.keras.layers.Dense(10)
])

model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(),
metrics=['accuracy'])

model.fit(train_dataset, epochs=5)

eval_loss, eval_acc = model.evaluate(eval_dataset)

print('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))

GPU Enabled Deep Learning Model Batch Job

Interactive jobs can only be executed less than 48 hours on Shamu. In order to run your job for more than 48 hours, you need to submit your model training as a batch job to the cluster.

Here is the command to submit a batch job on a login node:

sbatch your-job-script-file

In the job script, you need to specify the GPU resources by "#SBATCH --gres=gpu:k80:1", where "1" is the number of GPU devices you need. The number can be up to 8 on a K80 GPU node and can be up to 4 on a V100 node.

Here is an example of a job script:

#!/bin/bash
#SBATCH --job-name=test_k80_gpu
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:k80:1
#SBATCH --mail-type=ALL
#SBATCH --mail-user=your-email-address

. /etc/profile.d/modules.sh
module load cuda90/toolkit/9.0.176
module load anaconda3
conda activate tf-gpu

srun python mnist-multi-gpus.py

-- Zhiwei - 26 Jun. 2020
Topic revision: r15 - 26 Jun 2020, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback