By using the checkpoint feature, model progress can be saved during training. The model can resume training where it left off and avoid starting from scratch if something happens during the training. Here is an example code with a checkpoint and restart feature.
import tensorflow as tf
from keras.callbacks import ModelCheckpoint
import os.path
from os import path

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

filename = "mymodel.h5"

# check if checkpoint file exists. if does, load the model and skip building the model
if (path.isfile(filename)):
print("Resuming")
model = tf.keras.models.load_model(filename)
else:
print('Build the model from scratch')

model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

checkpoint = ModelCheckpoint(filename, monitor='loss', verbose=1, save_best_only=True, mode='min')

model.fit(x_train, y_train, epochs=5, batch_size = 1000, validation_split = 0.1, callbacks=[checkpoint])

model.evaluate(x_test, y_test, verbose=2)

When the model is trained the first time, it will build the model from scratch as there is no checkpoint file yet. The output looks like the following:
Using TensorFlow backend.
Build the model from scratch
2020-06-24 09:41:42.120914: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2020-06-24 09:41:42.121145: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 4. Tune using inter_op_parallelism_threads for best performance.
Train on 54000 samples, validate on 6000 samples
Epoch 1/5
51000/54000 [===========================>..] - ETA: 0s - loss: 0.9134 - accuracy: 0.7435
Epoch 00001: loss improved from inf to 0.88658, saving model to mymodel.h5
54000/54000 [==============================] - 2s 44us/sample - loss: 0.8866 - accuracy: 0.7507 - val_loss: 0.3239 - val_accuracy: 0.9155
Epoch 2/5
51000/54000 [===========================>..] - ETA: 0s - loss: 0.3818 - accuracy: 0.8910
Epoch 00002: loss improved from 0.88658 to 0.37801, saving model to mymodel.h5
54000/54000 [==============================] - 1s 27us/sample - loss: 0.3780 - accuracy: 0.8920 - val_loss: 0.2441 - val_accuracy: 0.9340
Epoch 3/5
52000/54000 [===========================>..] - ETA: 0s - loss: 0.3066 - accuracy: 0.9125
Epoch 00003: loss improved from 0.37801 to 0.30647, saving model to mymodel.h5
54000/54000 [==============================] - 1s 24us/sample - loss: 0.3065 - accuracy: 0.9125 - val_loss: 0.2033 - val_accuracy: 0.9455
Epoch 4/5
52000/54000 [===========================>..] - ETA: 0s - loss: 0.2621 - accuracy: 0.9250
Epoch 00004: loss improved from 0.30647 to 0.26205, saving model to mymodel.h5
54000/54000 [==============================] - 1s 27us/sample - loss: 0.2620 - accuracy: 0.9245 - val_loss: 0.1770 - val_accuracy: 0.9540
Epoch 5/5
53000/54000 [============================>.] - ETA: 0s - loss: 0.2324 - accuracy: 0.9341
Epoch 00005: loss improved from 0.26205 to 0.23274, saving model to mymodel.h5
54000/54000 [==============================] - 1s 26us/sample - loss: 0.2327 - accuracy: 0.9341 - val_loss: 0.1583 - val_accuracy: 0.9595
10000/1 - 1s - loss: 0.1237 - accuracy: 0.9462

Process finished with exit code 0

When the model is executed again in the same directory, The model is loaded from the checkpoint file and continues the training from there it was left off. The output looks like the following:
Using TensorFlow backend.
Resuming
2020-06-24 10:11:01.443935: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2020-06-24 10:11:01.445295: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 4. Tune using inter_op_parallelism_threads for best performance.
Train on 54000 samples, validate on 6000 samples
Epoch 1/5
51000/54000 [===========================>..] - ETA: 0s - loss: 0.2098 - accuracy: 0.9394
Epoch 00001: loss improved from inf to 0.20846, saving model to mymodel.h5
54000/54000 [==============================] - 2s 38us/sample - loss: 0.2085 - accuracy: 0.9398 - val_loss: 0.1432 - val_accuracy: 0.9615
Epoch 2/5
53000/54000 [============================>.] - ETA: 0s - loss: 0.1888 - accuracy: 0.9464
Epoch 00002: loss improved from 0.20846 to 0.18883, saving model to mymodel.h5
54000/54000 [==============================] - 1s 25us/sample - loss: 0.1888 - accuracy: 0.9464 - val_loss: 0.1319 - val_accuracy: 0.9667
Epoch 3/5
51000/54000 [===========================>..] - ETA: 0s - loss: 0.1723 - accuracy: 0.9505
Epoch 00003: loss improved from 0.18883 to 0.17294, saving model to mymodel.h5
54000/54000 [==============================] - 1s 27us/sample - loss: 0.1729 - accuracy: 0.9503 - val_loss: 0.1226 - val_accuracy: 0.9672
Epoch 4/5
51000/54000 [===========================>..] - ETA: 0s - loss: 0.1602 - accuracy: 0.9532
Epoch 00004: loss improved from 0.17294 to 0.15976, saving model to mymodel.h5
54000/54000 [==============================] - 1s 25us/sample - loss: 0.1598 - accuracy: 0.9535 - val_loss: 0.1155 - val_accuracy: 0.9705
Epoch 5/5
52000/54000 [===========================>..] - ETA: 0s - loss: 0.1500 - accuracy: 0.9570
Epoch 00005: loss improved from 0.15976 to 0.14921, saving model to mymodel.h5
54000/54000 [==============================] - 2s 28us/sample - loss: 0.1492 - accuracy: 0.9574 - val_loss: 0.1088 - val_accuracy: 0.9710
10000/1 - 1s - loss: 0.0827 - accuracy: 0.9642

Process finished with exit code 0

-- Zhiwei - 12 Jul 2020

This topic: Main > WebHome > ApplicationCheckpointAndRestart > Checkpoint-and-RestartForDeepLearningModelsWithTensorflow
Topic revision: 14 Jan 2021, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback