TensorFlow CPU Version

First, grab a compute node with srun and start a Python Virtualenv environment:
[abc123@login-0-0 ~]$ srun -n 80 -N 1 --time=48:00:00 --pty bash

[abc123@compute045 ~]$ virtualenv --system-site-packages tensorflow-cpu
New python executable in tensorflow-cpu/bin/python
Installing Setuptools..............................................................................................................................................................................................................................done.
Installing Pip.....................................................................................................................................................................................................................................................................................................................................done.

Activate the newly created Python virtualenv:
[abc123@compute045 ~]$ source ~/tensorflow-cpu/bin/activate
(tensorflow-cpu)[abc123@compute045 ~]$

Now we need to upgrade some default packages which are required by TensorFlow:
(tensorflow-cpu)[abc123@compute045 ~]$ pip install --upgrade setuptools
(tensorflow-cpu)[abc123@compute045 ~]$ pip install --upgrade enum34 futures

Now lets download, compile, and install TensorFlow into your Python virtualenv:
(tensorflow-cpu)[abc123@compute045 ~]$ pip install --upgrade tensorflow

Once it is installed, let's perform a quick test to verify the CPU version is working (you can disregard the warning about AVX2 and FMA extensions):
(tensorflow-cpu)[abc123@compute045 ~]$ python
Python 2.7.5 (default, Aug 4 2017, 00:39:18)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2018-05-09 10:29:54.945974: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
>>> print(sess.run(hello))
Hello, TensorFlow!

Once TensorFlow is installed via the above instructions, you do not need to repeat the process. All you have to do is source the "activate" file the next time you want to run TensorFlow. To disable the above warning message, add this to your Python code:
import os
import tensorflow as tf

TensorFlow GPU Version

Grab a GPU node with qlogin (note - you first need permission to use the GPU nodes. If you require access to the GPU nodes send an email to rcsg@utsa.edu to request access):

[abc123@login-0-0 ~]$ srun -p gpu --gres=gpu:k80:4 --time=12:00:00 --pty bash

Load the modules required for TensorFlow (CUDA, cudNN, etc..)

[abc123@gpu02 ~]$ module load python/3.6.1 cuda90/toolkit cudnn/7.0

Enter the python shell:

[abc123@gpu02 ~]$ python3
Python 3.6.1 (default, May 16 2017, 15:27:50)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux
Type "help", "copyright", "credits" or "license" for more information.

Enter the following Hello World test program:

>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2018-02-27 08:14:55.501487: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-02-27 08:14:56.177280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:06:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:56.412525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 1 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:07:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:56.654330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 2 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:0a:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:56.902953: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 3 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:0b:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:57.180367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 4 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:86:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:57.465869: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 5 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:87:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:57.768378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 6 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:8a:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:58.047947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 7 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:8b:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:58.052993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Device peer to peer matrix
2018-02-27 08:14:58.053262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1126] DMA: 0 1 2 3 4 5 6 7
2018-02-27 08:14:58.053276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 0: Y Y Y Y N N N N
2018-02-27 08:14:58.053284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 1: Y Y Y Y N N N N
2018-02-27 08:14:58.053292: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 2: Y Y Y Y N N N N
2018-02-27 08:14:58.053300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 3: Y Y Y Y N N N N
2018-02-27 08:14:58.053308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 4: N N N N Y Y Y Y
2018-02-27 08:14:58.053321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 5: N N N N Y Y Y Y
2018-02-27 08:14:58.053328: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 6: N N N N Y Y Y Y
2018-02-27 08:14:58.053336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 7: N N N N Y Y Y Y
2018-02-27 08:14:58.053357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:06:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053379: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:07:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:0a:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:0b:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:4) -> (device: 4, name: Tesla K80, pci bus id: 0000:86:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:5) -> (device: 5, name: Tesla K80, pci bus id: 0000:87:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053422: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:6) -> (device: 6, name: Tesla K80, pci bus id: 0000:8a:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:7) -> (device: 7, name: Tesla K80, pci bus id: 0000:8b:00.0, compute capability: 3.7)
>>> print(sess.run(hello))
Hello, TensorFlow!

If you want to disable the above warning messages, add this to your code:

import os
import tensorflow as tf

Important Note: By default, TensorFlow takes all GPU resources on a GPU node (8 GPU units on each GPU nodes on Shamu). It is required to change your code to avoid hogging the GPU node. For example, you need to put in the followings in your code if only GPU unit 1 and 3 are needed:

import os​

To check the availability of the GPU units on a node, use the following command:

[abc123@gpu02 ~]$ nvidia-smi

Deep Learning Model Batch Job

Interactive jobs can only run less than 48 hours on Shamu. In order to run your job for more than 48 hours, you need to submit your model training as a batch job to the cluster. Here is the command to submit a batch job on a login node: sbatch your-job-script-file In the job script, you need to specify the GPU resources by "#SBATCH --gres=gpu:k80:1", where "1" is the number of GPU devices you need. The number can be up to 8 on a K80 GPU node and can be up to 4 on a V100 node. Here is an example of a job script:
#SBATCH --job-name=test_k80_gpu
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:k80:1
#SBATCH --time=02:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=your-email-address

. /etc/profile.d/modules.sh
module load cuda90/toolkit/9.0.176
module load anaconda3
conda activate tf-gpu

srun python mnist-multi-gpus.py

Parallelize Deep Learning Models Across Multiple GPU Devices

Deep Learning models written in Tensorflow can automatically take advantage of a GPU device on a compute node, as long as the Tensorflow package is GPU enabled. We suggest using Anaconda to create a virtual environment and install tensorflow-gpu package in the virtual environment. First of all, you need to log onto one of the GPU node by typing the following command on the login node:
srun -p gpu --gres=gpu:k80:4 --pty bash

"-p gpu" means that you would like to grab a node in the gpu partition, which includes two nodes with eight K80 GPU devices each. The "GPU" can be changed to "gpu-v100" if you would like to use one of the nodes in gpu-v100 partition, which has one node with four V100 GPU devices. Or use "gpu-v100-shared" which also has one node with four V100 GPU devices, but up to four users can log onto the node at the same time. "--gres=gpu:k80:4" indicates that four GPU devices will be used. That number can be up to eight for gpu partition, and four for both gpu-v100 and gpu-v100-shared. Once you log onto a GPU node, please refer to https://hpcsupport.utsa.edu/foswiki/bin/view/Main/PythonVmsInAnaconda for activating Anaconda and creating a virtual environment. The following command will install tensorflow-gpu package in your virtual environment.
module load cuda90/toolkit/9.0.176
conda install -c anaconda tensorflow-gpu

The coda90 module is needed for both installing and using the package.
Note: By default, tensorflow-gpu will grab all available GPU devices on the node, even only one is used in your model code. Make sure to add the following lines in your code if one GPU is needed:
import os​

"1" is the GPU device ID, which starts from 0 to n-1, where n is the GPU device's total number on the node. You can decide which one to use by running the command below, which shows the available device status and IDs
[abc123@gpu02 ~]$ nvidia-smi

If you want to run a DP model across multiple devices, make sure you grab enough GPU devices (say, four GPU devices are needed) when logging onto the node with --gres=gpu:k80:4. And in your code, put in the following lines to have the model to access for GPU devices:
import os​

Tensorflow comes with a build-in mechanism to train a Deep Learning model across multiple GPU devices on the same computer or node. Here is a sample Deep Leaning model written in Keras. In this example, the learning model is duplicated to all the GPU devices. Each device runs the same model in the training stage with a portion of the training data set. The training speed is significantly improved due to parallelization.
import tensorflow_datasets as tfds
import tensorflow as tf
import os

datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)

mnist_train, mnist_test = datasets['train'], datasets['test']

# Ues the strategy for data parallelization across multi-gpu
strategy = tf.distribute.MirroredStrategy()

num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples


print(f'strategy.num_replicas_in_sync: {strategy.num_replicas_in_sync}')

BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255
return image, label

#Apply this function to the training and test data, shuffle the training data, and batch it for training. Notice we are also keeping an in-memory cache of the training data to improve performance.

train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)

#Within the distributed strategy scope, the data in a batch is evenly distrinuted across the GPUs.
with strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(256, activation='relu'),


model.fit(train_dataset, epochs=10)

eval_loss, eval_acc = model.evaluate(eval_dataset)
print('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))

Run the model on one of the nodes in the gpu partition, and time it as following:
>time python mnist-multi-gpus.py 
235/235 [==============================] - 2s 10ms/step - accuracy: 0.9197 - loss: 0.2582
Epoch 2/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9791 - loss: 0.0685
Epoch 3/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9859 - loss: 0.0459
Epoch 4/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9889 - loss: 0.0349
Epoch 5/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9917 - loss: 0.0242
Epoch 6/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9933 - loss: 0.0201
Epoch 7/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9952 - loss: 0.0137
Epoch 8/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9964 - loss: 0.0112
Epoch 9/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9960 - loss: 0.0121
Epoch 10/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9969 - loss: 0.0093
Epoch 11/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9971 - loss: 0.0093
Epoch 12/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9977 - loss: 0.0064
Epoch 13/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9978 - loss: 0.0067
Epoch 14/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9979 - loss: 0.0065
Epoch 15/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9976 - loss: 0.0069
Epoch 16/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9980 - loss: 0.0074
Epoch 17/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9973 - loss: 0.0078
Epoch 18/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9980 - loss: 0.0060
Epoch 19/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9985 - loss: 0.0045
Epoch 20/20
235/235 [==============================] - 1s 6ms/step - accuracy: 0.9982 - loss: 0.0053
40/40 [==============================] - 0s 12ms/step - accuracy: 0.9866 - loss: 0.0625
Eval loss: 0.06250440329313278, Eval Accuracy: 0.9865999817848206

real 0m46.234s
user 1m28.432s
sys 0m22.238s

If change the device line to the following:
import os​

And run the model with time:
>time python mnist-multi-gpus.py 
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9483 - loss: 0.1615
Epoch 2/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9826 - loss: 0.0571
Epoch 3/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9883 - loss: 0.0384
Epoch 4/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9912 - loss: 0.0294
Epoch 5/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9930 - loss: 0.0226
Epoch 6/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9947 - loss: 0.0175
Epoch 7/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9950 - loss: 0.0158
Epoch 8/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9962 - loss: 0.0120
Epoch 9/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9967 - loss: 0.0119
Epoch 10/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9965 - loss: 0.0109
Epoch 11/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9966 - loss: 0.0118
Epoch 12/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9977 - loss: 0.0071
Epoch 13/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9977 - loss: 0.0085
Epoch 14/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9980 - loss: 0.0074
Epoch 15/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9979 - loss: 0.0076
Epoch 16/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9980 - loss: 0.0075
Epoch 17/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9979 - loss: 0.0077
Epoch 18/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9982 - loss: 0.0062
Epoch 19/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9976 - loss: 0.0090
Epoch 20/20
938/938 [==============================] - 4s 4ms/step - accuracy: 0.9984 - loss: 0.0052
157/157 [==============================] - 0s 3ms/step - accuracy: 0.9875 - loss: 0.0757
Eval loss: 0.07569707930088043, Eval Accuracy: 0.987500011920929

real 1m27.720s
user 1m46.541s
sys 0m26.354s

The training speed with 4 GPU device is almost twice as faster as the speed trained with one GPU.

Parallelize Deep Learning Models on a Multiple-CPU node

GPU devices can speed up the training processes, as shown in the previous section. However, GPU-enabled compute-nodes may not be available due to limited resources. In this case, users can run a DP model on a common compute node, which usually has 40 to 80 cores on the Shamu cluster. A DP model written with Tensorflow or Keras automatically utilizes all available cores on the node to improve training speed. In our experiments, although there are no GPU devices on the common compute node, GPU enabled Tensorflow can make a DP model runs much faster than using non-GPU Tensorflow. Our tests show that a typical DP model (6 layers, 12800 nodes per layer) running on an 80-core CPU only compute node takes only twice as much time as it running on a K80 GPU device. More importantly, the memory space (300G+) on the CPU-based environment is much larger than it on a GPU device (11G max for the K80 GPU on Shamu), which allows the model with a large data set to run on CPU-based environment without the out of memory issues. The following table shows the benchmark results:

Model info One k80 (11441MB) Four k80 11441MB) 80 v-cores 291G RAM @ 2.50GHz
mnist 64x1000x1000 modle, 20 epochs 1m35.461s 0m45.451s 2m29.401s
mnist 64x10000x1000 modle, 20 epochs 3m2.699s 1m39.629s 5m33.840s
mnist 64x10000x10000 modle, 20 epochs 15m13.231s 9m3.970s 40m3.251s
mnist 64x10000x5000x10000 modle, 20 epochs 15m27.995s 9m8.889s 35m10.203s
mnist 64x50000x10000 modle, 1 epoch 3m50.561s Resource exhausted 28m43.907s
mnist 64x100000x10000 modle, 1 epochs ResourceExhausted ResourceExhausted 56m51.650s
mnist 64x100000x100000 modle, 1 epoch ResourceExhausted ResourceExhausted egmentation fault (core dumped)

Parallelize Deep Learning Models Memory Issues on GPU devices

There is limited memory resource on GPU devices. Running large models with large data set may run into memory-related issues. Typically, memory problems can be solved by the following approaches:

- Use data generators

- Reduce the batch size

- Apply the gradient accumulation approach

Use data generators

Large datasets are increasingly becoming part of our lives. Usually, a deep learning model needs to load the entire data set onto the GPU device before starting the training process. A large dataset that consumes too much valuable memory may make the training stop due to the limited memory resources. Data generator solves the problem by feeding only a portion of data to the model. The following example demonstrates how it works:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import numpy as np
import tensorflow

# Num rows
num_rows = 5e8 # Five hundred million
batch_size = 250

# Load data
def generate_arrays_from_file(path, batchsize):
inputs = []
targets = []
batchcount = 0
while True:
with open(path) as f:
for line in f:
x,y = line.split(',')
batchcount += 1
if batchcount > batchsize:
X = np.array(inputs, dtype='float32')
y = np.array(targets, dtype='float32')
yield (X, y)
inputs = []
targets = []
batchcount = 0

# Create the model
model = Sequential()
model.add(Dense(16, input_dim=1, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='linear'))

# Compile the model

# Fit data to model
model.fit(generate_arrays_from_file('./five_hundred.csv', batch_size),
steps_per_epoch=num_rows / batch_size, epochs=10)

Here is another example of the data generator. unlike the example above where the data generator function fetches a batch of data from the file and feed the data to the model, the following example reloads the data into memory.

import numpy as np
import keras

class DataGenerator(keras.utils.Sequence):
def __init__(self, data, labels, batch_size=100, shuffle=True):
self.batch_size = batch_size
self.labels = labels
self.data = data
self.shuffle = shuffle

def __len__(self):
return int(np.floor(len(self.data) / self.batch_size))

def __getitem__(self, index):
X = self.data[index*self.batch_size:(index+1)*self.batch_size]
y = self.labels[index*self.batch_size:(index+1)*self.batch_size]
return X, y

from my_classes import DataGenerator
import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

# Generators
training_generator = DataGenerator(x_train, y_train)
validation_generator = DataGenerator(x_test, y_test)

# GPU info
print("\nTensorflow Version: " + tf.__version__)

model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(100, activation='relu'),
tf.keras.layers.Dense(200, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
checkpoint_path = "training_1/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
model.fit(training_generator, epochs=5, callbacks=[cp_callback])
model.evaluate(validation_generator, verbose=2)

Reduce the batch size

Large batch sizes take more memory to compute the gradients for each batch. Reduce the size of the batch can save memory. In the following example, changing the batch_size from 512 to 128 will significantly reduce memory consumption. Please note: data parallelization across multiple GPU devices can also reduce the batch size on each device while maintaining the expected batch size overall. But it does not reduce the memory consumption based on our benchmarks. The cross-device synchronization may add more memory cost.

(x_train, y_train), (X_test, Y_test) = tf.keras.datasets.mnist.load_data()
X_train = x_train.reshape(x_train.shape[0], 28, 28,1)
y_train = tf.keras.utils.to_categorical(y_train,10)
X_test = X_test.reshape(X_test.shape[0], 28, 28,1)
Y_test = tf.keras.utils.to_categorical(Y_test,10)

batch_size = 512

train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
train_dataset = train_dataset.batch(batch_size)

test_dataset = tf.data.Dataset.from_tensor_slices((X_test, Y_test))
test_dataset = test_dataset.batch(batch_size)

model.compile(optimizer=optimizer_grad_accumulattion, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(train_dataset, epochs = 10, steps_per_epoch = len(x_train)//batch_size,validation_data=test_dataset,validation_steps=1, callbacks=tbCallback)Data parallelization across multiple-gpu devices

Apply the gradient accumulation approach

The gradient accumulation approach can solve the problem by computing a mini-batch of data at a time on a GPU device, and apply the accumulated gradient after the whole batch is finished ON ONE GPU. The following example uses a python package named runai. The package does not compatible with the newest Python and Keras. Users need to create a virtual environment with Python 3.7 and standalone Keras-gpu v2.2.4, and pip installed runai.

import runai.ga.keras


optimizer = runai.ga.keras.optimizers.Adam(STEPS)


history = model.fit(x_train, y_train,
batch_size=BATCH_SIZE // STEPS,
validation_data=(x_test, y_test))

-- Zhiwei - 13 Jan 2021
Topic revision: r5 - 18 Jan 2021, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback