PyTorch is a Python-based machine learning framework that can run on both CPUs and
GPUs. It supports options for taking advantage of
GPUs on single nodes or multiple nodes using data parallelism and model parallelism.
Data parallelism involves splitting the data into smaller subsets and using multiple
GPUs to process the subsets in parallel while using the same model.
PyTorch supports data parallelism through two classes namely
DataParallel and
DistributedDataParallel. The
DataParallel class can be used when multiple
GPUs are available on a single node while the
DistributedDataParallel can be used with
GPUs available on either single or multiple nodes. The
PyTorch documentation however advises using
DistributedDataParallel for using
GPUs on both single and multiple nodes as it reduce the overall training time.
Model parallelism involves splitting the model into smaller segments such that each segment can be run on a different
GPU. This method can be used when a model is too large to fit on a single
GPU. This option creates dependencies between the
GPUs such that one segment on a
GPU waits for input from another segment running on another
GPU before it can proceed (i.e., the segments of the model are prevented from running in parallel). Therefore, this kind of parallelism should not be used for training those models that can fit on a single
GPU.
For running a
PyTorch model on Arc, it is recommended to create a Python Virtual Environment (VE) with
PyTorch related packages. You can use Anaconda to create a VE with Python 3.9.0 and install the required packages as shown below once you are connected to a compute node interactively (sample command to get interactive access to a node in the
GPU queue is as follows:
srun -p gpu2v100 -N 1 -t 02:00:00 --pty bash
$ module load anaconda3
$ conda create --prefix=p39-torch python=3.9.0
After running the second command listed above, you will be prompted for a "y/n" response to accept the list of packages to be installed as shown below and you might want to type "y" to proceed:
The following NEW packages will be INSTALLED:
_libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main
_openmp_mutex pkgs/main/linux-64::_openmp_mutex-4.5-1_gnu
ca-certificates pkgs/main/linux-64::ca-certificates-2022.3.29-h06a4308_1
certifi pkgs/main/linux-64::certifi-2021.10.8-py39h06a4308_2
ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.35.1-h7274673_9
libffi pkgs/main/linux-64::libffi-3.3-he6710b0_2
libgcc-ng pkgs/main/linux-64::libgcc-ng-9.3.0-h5101ec6_17
libgomp pkgs/main/linux-64::libgomp-9.3.0-h5101ec6_17
libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-9.3.0-hd4cf53a_17
ncurses pkgs/main/linux-64::ncurses-6.3-h7f8727e_2
openssl pkgs/main/linux-64::openssl-1.1.1n-h7f8727e_0
pip pkgs/main/linux-64::pip-21.2.4-py39h06a4308_0
python pkgs/main/linux-64::python-3.9.0-hdb3f193_2
readline pkgs/main/linux-64::readline-8.1.2-h7f8727e_1
setuptools pkgs/main/linux-64::setuptools-61.2.0-py39h06a4308_0
sqlite pkgs/main/linux-64::sqlite-3.38.2-hc218d9a_0
tk pkgs/main/linux-64::tk-8.6.11-h1ccaba5_0
tzdata pkgs/main/noarch::tzdata-2022a-hda174b7_0
wheel pkgs/main/noarch::wheel-0.37.1-pyhd3eb1b0_0
xz pkgs/main/linux-64::xz-5.2.5-h7b6447c_0
zlib pkgs/main/linux-64::zlib-1.2.12-h7f8727e_2
Proceed ([y]/n)? y
Once the download and installation of the packages is complete, you will see messages as shown below:
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
# $ conda activate /work/vrv207/ml/p39-torch
#
# To deactivate an active environment, use
#
# $ conda deactivate
The conda activate path will be specific to your environment and hence you will see a different path in place of "
/work/vrv207/ml/p39-torch" path that is shown above. You can copy the conda activate command shown on your screen during the installation process and proceed as follows:
$ conda activate /work/vrv207/ml/p39-torch
# Note: please replace the path in the command above with your own installation path
$ conda install -c pytorch pytorch torchvision
# Note: after running the command above you will be prompted for accepting or declining the download and installation of the PyTorch packages. Please accept by typing "y".
$ pip install Pillow==6.1
Data Parallelism
We can use Data Parallelism on a single node with multiple
GPUs or on multiple nodes having single or multiple
GPUs.
Single Node Data Parallelism
Here is an example of MNIST data classification utilizing multiple
GPUs on a single node to improve the performance. In the example, we use "
nn.DataParallel
()" function to implement Data Parallelism if multiple
GPUs are detected (see the portion of code in bold). The same code can be used on a single
GPU environment without any changes since it can automatically detect the
GPU settings.
import collections
import collections.abc
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
n_epochs = 10
batch_size_train = 64
batch_size_test = 1000
learning_rate = 0.01
momentum = 0.5
log_interval = 10
random_seed = 1
torch.backends.cudnn.enabled = False
torch.manual_seed(random_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_loader = torch.utils.data.DataLoader(
torchvision.datasets.MNIST('files/', train=True, download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
(0.1307,), (0.3081,))
])),
batch_size=batch_size_train, shuffle=True)
test_loader = torch.utils.data.DataLoader(
torchvision.datasets.MNIST('files/', train=False, download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
(0.1307,), (0.3081,))
])),
batch_size=batch_size_test, shuffle=True)
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
network = Net()
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
network = nn.DataParallel(network)
network.to(device)
optimizer = optim.SGD(network.parameters(), lr=learning_rate,
momentum=momentum)
train_losses = []
train_counter = []
test_losses = []
test_counter = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]
def train(epoch):
network.train()
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
inputs, label = data.to(device),target.to(device)
output = network(inputs)
loss = F.nll_loss(output, label)
loss.backward()
optimizer.step()
if batch_idx % log_interval == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item()))
train_losses.append(loss.item())
train_counter.append(
(batch_idx*64) + ((epoch-1)*len(train_loader.dataset)))
torch.save(network.state_dict(), 'results/model.pth')
torch.save(optimizer.state_dict(), 'results/optimizer.pth')
for epoch in range(1, n_epochs + 1):
train(epoch)
To test the Data Parallelism model, you need to log onto a node with two
GPUs in the gpu2v100 partition from a login node using the following command:
$ srun -p gpu2v100 -t 8:00:00 -n 1 -N 1 --pty bash
Assuming you save the example above as test.py, you can use the following command to run the model:
$ python test.py
Train Epoch: 1 [0/60000 (0%)] Loss: 2.327492
Train Epoch: 1 [640/60000 (1%)] Loss: 2.328194
Train Epoch: 1 [1280/60000 (2%)] Loss: 2.278235
........
To verify if both
GPUs are utilized to run the model, you can start a new ssh session and log onto the same
GPU node where your model is running (you are only allowed to ssh to a node where you have a job running). Once you are on the same
GPU node where your model is running, use the "nvidia-smi" command to check the status of the
GPUs. You will see the output looks like the following which indicates both
GPUs are utilized:
$ [gpu031: iqr224]$ nvidia-smi
Sat Apr 16 14:43:04 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100S-PCI... Off | 00000000:3B:00.0 Off | Off |
| N/A 31C P0 35W / 250W | 1275MiB / 32510MiB | 21% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100S-PCI... Off | 00000000:D8:00.0 Off | Off |
| N/A 33C P0 36W / 250W | 1275MiB / 32510MiB | 20% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 207116 C python 1271MiB |
| 1 N/A N/A 207116 C python 1271MiB |
+-----------------------------------------------------------------------------+
Across Node Data Parallelism
in case you need more than two
GPUs to run your model to further improve the performance, you would need to use
Distributed Data Parallelism. Here is a tutorial about how to train a
PyTorch-based Deep Learning model using multiple
GPU devices across multiple nodes on an HPC cluster:
https://tuni-itc.github.io/wiki/Technical-Notes/Distributed_dataparallel_pytorch/
There are a few bugs in the example code in the tutorial. Make sure to change the following lines of code:
model = AE(input_shape=784).cuda(args.gpus)
model = torch.nn.parallel.DistributedDataParallel( model_sync, device_ids=[args.gpu], find_unused_parameters=True )
to
model = AE(input_shape=784).cuda(args.gpu)
model = torch.nn.parallel.DistributedDataParallel( model, device_ids=[args.gpu], find_unused_parameters=True )
Model Parallelism
In order to run a model on a
GPU, the entire model needs to be transferred to the
GPU(s).
Data Parallelism (single node or multiple nodes) can improve the performance, but it cannot solve the situation where a model is too large to fit into the memory of a single
GPU. Model Parallelism resolved the problem by dividing the model into a few sub-models depending on the number of
GPUs available on a node and letting each
GPU host one of the sub-modes. In order to apply
Model Parallelism, we need to redefine the model by manually dividing the neuron network of the model into sub-networks. in the following example, we redefine the same neuron network in the example for Data Parallelism above.
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.seq1 = nn.Sequential(
nn.Conv2d(1, 10, kernel_size=5),
nn.MaxPool2d(2),
nn.ReLU(),
).to('cuda:0')
self.seq2 = nn.Sequential(
nn.Conv2d(10, 20, kernel_size=5),
nn.Dropout2d(),
nn.MaxPool2d(2),
nn.ReLU(),
nn.Flatten(),
nn.Linear(320, 50),
nn.ReLU(),
nn.Dropout2d(),
nn.Linear(50, 10)
).to('cuda:1')
def forward(self, x):
x = self.seq1(x).to('cuda:1')
x = self.seq2(x)
return F.log_softmax(x, dim=1)
network = Net()
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
#network = nn.DataParallel(network)
#network.to(device)
optimizer = optim.SGD(network.parameters(), lr=learning_rate,
momentum=momentum)
train_losses = []
train_counter = []
test_losses = []
test_counter = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]
def train(epoch):
network.train()
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
inputs, label = data.to('cuda:0'),target.to('cuda:1')
output = network(inputs)
loss = F.nll_loss(output, label)
loss.backward()
optimizer.step()
if batch_idx % log_interval == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item()))
train_losses.append(loss.item())
train_counter.append(
(batch_idx*64) + ((epoch-1)*len(train_loader.dataset)))
torch.save(network.state_dict(), 'results/model.pth')
torch.save(optimizer.state_dict(), 'results/optimizer.pth')
for epoch in range(1, n_epochs + 1):
train(epoch)
As you can see, the neuron network is divided into two sub-net: self.seq1 and self.seq2. please note: #network = nn.DataParallel(network) and #network.to(device) are commented out. You can run the model on a node with two
GPUs exactly the same way as running the example for Data Parallelism on a single node. Although the "nvidia-smi" command shows both
GPUs are utilized while the model is running, they are actually processing the data alternatively, thus no overall performance improvement can be observed.
Model Parallelism with Pipelining
Model Parallelism can divide a large model into sub-models and run one sub-network on each
GPU, thus making it possible for models with large neuron networks to utilize
GPUs in case the models are too large to fit in a single
GPU. However, this method typically does not speed up your training processes as data parallelism can. In this section of the tutorial, we introduce the technique called pipelining which can improve the training performance for
Model Parallelism.
in the following example, we modify the above example for simple
Model Parallelism by rewriting the "forward()" functions in the network class definition and keeping the rest of the code unchanged.
def forward(self, x):
splits = iter(x.split(self.split_size, dim=0))
s_next = next(splits)
s_prev = self.seq1(s_next).to('cuda:1')
ret = []
for s_next in splits:
# A. s_prev runs on cuda:1
s_prev = self.seq2(s_prev)
ret.append(F.log_softmax(s_prev, dim=1))
# B. s_next runs on cuda:0, which can run concurrently with A
s_prev = self.seq1(s_next).to('cuda:1')
s_prev = self.seq2(s_prev)
ret.append(F.log_softmax(s_prev, dim=1))
return torch.cat(ret)
As you can see, the input data is split into mini-batches, each mini-batch is processed in seq1 and then processed in seq2. Unlike the simple Model Parallelism where the two
GPUs process the entire input data alternatively, the two
GPUs process different mini-batches simultaneously like a pipeline, thus improving the performance.