By using the checkpoint feature, model progress can be saved during training. The model can resume training where it left off and avoid starting from scratch if something happens during the training. Here is an example code with a checkpoint and restart feature. This mode is designed to solve the MNIST handwritten digit classification problem. The training dataset is included in the Keras package and can be load by calling mnist.load_data() function. Executing the following lines of code in a Jopyter notebooks environment shows the 130th image in the dataset: import tensorflow as tf import matplotlib.pyplot as plt mnist = tf.keras.datasets.mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() plt.imshow(x_train[130])plt.imshow(x_train[130]) The MNIST handwritten digit classification model with a checkpoint and restart feature:

import tensorflow as tf
from keras.callbacks import ModelCheckpoint
import os.path
from os import path

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

filename = "mymodel.h5"

# check if checkpoint file exists. if does, load the model and skip building the model
if (path.isfile(filename)):
print("Resuming")
model = tf.keras.models.load_model(filename)
else:
print('Build the model from scratch')

model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

checkpoint = ModelCheckpoint(filename, monitor='loss', verbose=1, save_best_only=True, mode='min')

model.fit(x_train, y_train, epochs=5, batch_size = 1000, validation_split = 0.1, callbacks=[checkpoint])

model.evaluate(x_test, y_test, verbose=2)

When the model is trained the first time, it will build the model from scratch as there is no checkpoint file yet. The output looks like the following:

Using TensorFlow backend.
Build the model from scratch
2020-06-24 09:41:42.120914: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2020-06-24 09:41:42.121145: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 4. Tune using inter_op_parallelism_threads for best performance.
Train on 54000 samples, validate on 6000 samples
Epoch 1/5
51000/54000 [===========================>..] - ETA: 0s - loss: 0.9134 - accuracy: 0.7435
Epoch 00001: loss improved from inf to 0.88658, saving model to mymodel.h5
54000/54000 [==============================] - 2s 44us/sample - loss: 0.8866 - accuracy: 0.7507 - val_loss: 0.3239 - val_accuracy: 0.9155
Epoch 2/5
51000/54000 [===========================>..] - ETA: 0s - loss: 0.3818 - accuracy: 0.8910
Epoch 00002: loss improved from 0.88658 to 0.37801, saving model to mymodel.h5
54000/54000 [==============================] - 1s 27us/sample - loss: 0.3780 - accuracy: 0.8920 - val_loss: 0.2441 - val_accuracy: 0.9340
Epoch 3/5
52000/54000 [===========================>..] - ETA: 0s - loss: 0.3066 - accuracy: 0.9125
Epoch 00003: loss improved from 0.37801 to 0.30647, saving model to mymodel.h5
54000/54000 [==============================] - 1s 24us/sample - loss: 0.3065 - accuracy: 0.9125 - val_loss: 0.2033 - val_accuracy: 0.9455
Epoch 4/5
52000/54000 [===========================>..] - ETA: 0s - loss: 0.2621 - accuracy: 0.9250
Epoch 00004: loss improved from 0.30647 to 0.26205, saving model to mymodel.h5
54000/54000 [==============================] - 1s 27us/sample - loss: 0.2620 - accuracy: 0.9245 - val_loss: 0.1770 - val_accuracy: 0.9540
Epoch 5/5
53000/54000 [============================>.] - ETA: 0s - loss: 0.2324 - accuracy: 0.9341
Epoch 00005: loss improved from 0.26205 to 0.23274, saving model to mymodel.h5
54000/54000 [==============================] - 1s 26us/sample - loss: 0.2327 - accuracy: 0.9341 - val_loss: 0.1583 - val_accuracy: 0.9595
10000/1 - 1s - loss: 0.1237 - accuracy: 0.9462

Process finished with exit code 0

When the model is executed again in the same directory, The model is loaded from the checkpoint file and continues the training from there it was left off. The output looks like the following:

Using TensorFlow backend.
Resuming
2020-06-24 10:11:01.443935: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2020-06-24 10:11:01.445295: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 4. Tune using inter_op_parallelism_threads for best performance.
Train on 54000 samples, validate on 6000 samples
Epoch 1/5
51000/54000 [===========================>..] - ETA: 0s - loss: 0.2098 - accuracy: 0.9394
Epoch 00001: loss improved from inf to 0.20846, saving model to mymodel.h5
54000/54000 [==============================] - 2s 38us/sample - loss: 0.2085 - accuracy: 0.9398 - val_loss: 0.1432 - val_accuracy: 0.9615
Epoch 2/5
53000/54000 [============================>.] - ETA: 0s - loss: 0.1888 - accuracy: 0.9464
Epoch 00002: loss improved from 0.20846 to 0.18883, saving model to mymodel.h5
54000/54000 [==============================] - 1s 25us/sample - loss: 0.1888 - accuracy: 0.9464 - val_loss: 0.1319 - val_accuracy: 0.9667
Epoch 3/5
51000/54000 [===========================>..] - ETA: 0s - loss: 0.1723 - accuracy: 0.9505
Epoch 00003: loss improved from 0.18883 to 0.17294, saving model to mymodel.h5
54000/54000 [==============================] - 1s 27us/sample - loss: 0.1729 - accuracy: 0.9503 - val_loss: 0.1226 - val_accuracy: 0.9672
Epoch 4/5
51000/54000 [===========================>..] - ETA: 0s - loss: 0.1602 - accuracy: 0.9532
Epoch 00004: loss improved from 0.17294 to 0.15976, saving model to mymodel.h5
54000/54000 [==============================] - 1s 25us/sample - loss: 0.1598 - accuracy: 0.9535 - val_loss: 0.1155 - val_accuracy: 0.9705
Epoch 5/5
52000/54000 [===========================>..] - ETA: 0s - loss: 0.1500 - accuracy: 0.9570
Epoch 00005: loss improved from 0.15976 to 0.14921, saving model to mymodel.h5
54000/54000 [==============================] - 2s 28us/sample - loss: 0.1492 - accuracy: 0.9574 - val_loss: 0.1088 - val_accuracy: 0.9710
10000/1 - 1s - loss: 0.0827 - accuracy: 0.9642

Process finished with exit code 0

-- Zhiwei - 24 Jun. 2020

Deep Learning Model Parallelization Across Multiple GPUs

Parallelizing a Deep Learning model across multiple GPUs can improve training speed. It can also potentially solve the lack of memory problems by distributing the training dataset. Here is an example:

import tensorflow_datasets as tfds
import tensorflow as tf
tfds.disable_progress_bar()

import os

print(tf.__version__)
#import os
os.environ["CUDA_VISIBLE_DEVICES"]='0,1,2,3'
datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
print(f'dataset:\n {datasets}')
print(f'inof:\n {info}')

mnist_train, mnist_test = datasets['train'], datasets['test']

print(mnist_train)
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

# You can also do info.splits.total_num_examples to get the total
# number of examples in the dataset.

num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples

BUFFER_SIZE = 10000

print(f'strategy.num_replicas_in_sync: {strategy.num_replicas_in_sync}')

BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

print(f'batch size: {BATCH_SIZE}')

#Pixel values, which are 0-255, have to be normalized to the 0-1 range. Define this scale in a function.
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255

return image, label

#Apply this function to the training and test data, shuffle the training data, and batch it for training. Notice we are also keeping an in-memory cache of the training data to improve performance.

train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)


#Create and compile the Keras model in the context of strategy.scope.
with strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1000, activation='relu'),
tf.keras.layers.Dense(1000, activation='relu'),
tf.keras.layers.Dense(10)
])

model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(),
metrics=['accuracy'])

model.fit(train_dataset, epochs=5)

eval_loss, eval_acc = model.evaluate(eval_dataset)

print('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))

GPU Enabled Deep Learning Model Batch Job

Interactive jobs can only be executed less than 48 hours on Shamu. In order to run your job for more than 48 hours, you need to submit your model training as a batch job to the cluster. Here is the command to submit a batch job on a login node: sbatch your-job-script-file In the job script, you need to specify the GPU resources by "#SBATCH --gres=gpu:k80:1", where "1" is the number of GPU devices you need. The number can be up to 8 on a K80 GPU node and can be up to 4 on a V100 node. Here is an example of a job script:

#!/bin/bash
#SBATCH --job-name=test_k80_gpu
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:k80:1
#SBATCH --mail-type=ALL
#SBATCH --mail-user=your-email-address

. /etc/profile.d/modules.sh
module load cuda90/toolkit/9.0.176
module load anaconda3
conda activate tf-gpu

srun python mnist-multi-gpus.py

-- Zhiwei - 12 Jul 2020
Topic revision: r2 - 13 Jul 2020, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding UTSA Research Support Group? Send feedback