. It supports options for taking advantage of
on single nodes or multiple nodes using data parallelism and model parallelism.
Data parallelism involves splitting the data into smaller subsets and using multiple
to process the subsets in parallel while using the same model.
. The
available on either single or multiple nodes. The
on both single and multiple nodes as it reduce the overall training time.
Model parallelism involves splitting the model into smaller segments such that each segment can be run on a different
. This method can be used when a model is too large to fit on a single
. This option creates dependencies between the
before it can proceed (i.e., the segments of the model are prevented from running in parallel). Therefore, this kind of parallelism should not be used for training those models that can fit on a single
.
model on Arc, it is recommended to create a Python Virtual Environment (VE) with
related packages. You can use Anaconda to create a VE with Python 3.9.0 and install the required packages as shown below once you are connected to a compute node interactively (sample command to get interactive access to a node in the
After running the second command listed above, you will be prompted for a "y/n" response to accept the list of packages to be installed as shown below and you might want to type "y" to proceed:
Once the download and installation of the packages is complete, you will see messages as shown below:
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
# $ conda activate /work/vrv207/ml/p39-torch
#
# To deactivate an active environment, use
#
# $ conda deactivate
The conda activate path will be specific to your environment and hence you will see a different path in place of "/work/vrv207/ml/p39-torch" path that is shown above. You can copy the conda activate command shown on your screen during the installation process and proceed as follows:
$ conda activate /work/vrv207/ml/p39-torch
# Note: please replace the path in the command above with your own installation path
$ conda install -c pytorch pytorch torchvision
# Note: after running the command above you will be prompted for accepting or declining the download and installation of the PyTorch packages. Please accept by typing "y".
$ pip install Pillow==6.1
Data Parallelism
We can use Data Parallelism on a single node with multiple GPUs or on multiple nodes having single or multiple GPUs.
Single Node Data Parallelism
Here is an example of MNIST data classification utilizing multiple GPUs on a single node to improve the performance. In the example, we use "nn.DataParallel
()" function to implement Data Parallelism if multiple GPUs are detected (see the portion of code in bold). The same code can be used on a single GPU environment without any changes since it can automatically detect the GPU settings.
import collections
import collections.abc
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
n_epochs = 10
batch_size_train = 64
batch_size_test = 1000
learning_rate = 0.01
momentum = 0.5
log_interval = 10
random_seed = 1
torch.backends.cudnn.enabled = False
torch.manual_seed(random_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_loader = torch.utils.data.DataLoader(
torchvision.datasets.MNIST('files/', train=True, download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
(0.1307,), (0.3081,))
])),
batch_size=batch_size_train, shuffle=True)
test_loader = torch.utils.data.DataLoader(
torchvision.datasets.MNIST('files/', train=False, download=True,
transform=torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(
(0.1307,), (0.3081,))
])),
batch_size=batch_size_test, shuffle=True)
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
network = Net()
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
network = nn.DataParallel(network)
network.to(device)
optimizer = optim.SGD(network.parameters(), lr=learning_rate,
momentum=momentum)
train_losses = []
train_counter = []
test_losses = []
test_counter = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]
def train(epoch):
network.train()
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
inputs, label = data.to(device),target.to(device)
output = network(inputs)
loss = F.nll_loss(output, label)
loss.backward()
optimizer.step()
if batch_idx % log_interval == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item()))
train_losses.append(loss.item())
train_counter.append(
(batch_idx*64) + ((epoch-1)*len(train_loader.dataset)))
torch.save(network.state_dict(), 'results/model.pth')
torch.save(optimizer.state_dict(), 'results/optimizer.pth')
for epoch in range(1, n_epochs + 1):
train(epoch)
To test the Data Parallelism model, you need to log onto a node with two GPUs in the gpu2v100 partition from a login node using the following command:
$ srun -p gpu2v100 -t 8:00:00 -n 1 -N 1 --pty bash
Assuming you save the example above as test.py, you can use the following command to run the model:
$ python test.py
Train Epoch: 1 [0/60000 (0%)] Loss: 2.327492
Train Epoch: 1 [640/60000 (1%)] Loss: 2.328194
Train Epoch: 1 [1280/60000 (2%)] Loss: 2.278235
........
To verify if both GPUs are utilized to run the model, you can start a new ssh session and log onto the same GPU node where your model is running (you are only allowed to ssh to a node where you have a job running). Once you are on the same GPU node where your model is running, use the "nvidia-smi" command to check the status of the GPUs. You will see the output looks like the following which indicates both GPUs are utilized:
$ [gpu031: iqr224]$ nvidia-smi
Sat Apr 16 14:43:04 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100S-PCI... Off | 00000000:3B:00.0 Off | Off |
| N/A 31C P0 35W / 250W | 1275MiB / 32510MiB | 21% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100S-PCI... Off | 00000000:D8:00.0 Off | Off |
| N/A 33C P0 36W / 250W | 1275MiB / 32510MiB | 20% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 207116 C python 1271MiB |
| 1 N/A N/A 207116 C python 1271MiB |
+-----------------------------------------------------------------------------+
Across Node Data Parallelism
in case you need more than two GPUs to run your model to further improve the performance, you would need to use Distributed Data Parallelism. Here is a tutorial about how to train a PyTorch-based Deep Learning model using multiple GPU devices across multiple nodes on an HPC cluster:
https://tuni-itc.github.io/wiki/Technical-Notes/Distributed_dataparallel_pytorch/
There are a few bugs in the example code in the tutorial. Make sure to change the following lines of code:
model = AE(input_shape=784).cuda(args.gpus)
model = torch.nn.parallel.DistributedDataParallel( model_sync, device_ids=[args.gpu], find_unused_parameters=True )
to
model = AE(input_shape=784).cuda(args.gpu)
model = torch.nn.parallel.DistributedDataParallel( model, device_ids=[args.gpu], find_unused_parameters=True )
Model Parallelism
In order to run a model on a GPU, the entire model needs to be transferred to the GPU(s). Data Parallelism (single node or multiple nodes) can improve the performance, but it cannot solve the situation where a model is too large to fit into the memory of a single GPU. Model Parallelism resolved the problem by dividing the model into a few sub-models depending on the number of GPUs available on a node and letting each GPU host one of the sub-modes. In order to apply Model Parallelism, we need to redefine the model by manually dividing the neuron network of the model into sub-networks. in the following example, we redefine the same neuron network in the example for Data Parallelism above.
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.seq1 = nn.Sequential(
nn.Conv2d(1, 10, kernel_size=5),
nn.MaxPool2d(2),
nn.ReLU(),
).to('cuda:0')
self.seq2 = nn.Sequential(
nn.Conv2d(10, 20, kernel_size=5),
nn.Dropout2d(),
nn.MaxPool2d(2),
nn.ReLU(),
nn.Flatten(),
nn.Linear(320, 50),
nn.ReLU(),
nn.Dropout2d(),
nn.Linear(50, 10)
).to('cuda:1')
def forward(self, x):
x = self.seq1(x).to('cuda:1')
x = self.seq2(x)
return F.log_softmax(x, dim=1)
network = Net()
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
#network = nn.DataParallel(network)
#network.to(device)
optimizer = optim.SGD(network.parameters(), lr=learning_rate,
momentum=momentum)
train_losses = []
train_counter = []
test_losses = []
test_counter = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]
def train(epoch):
network.train()
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
inputs, label = data.to('cuda:0'),target.to('cuda:1')
output = network(inputs)
loss = F.nll_loss(output, label)
loss.backward()
optimizer.step()
if batch_idx % log_interval == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item()))
train_losses.append(loss.item())
train_counter.append(
(batch_idx*64) + ((epoch-1)*len(train_loader.dataset)))
torch.save(network.state_dict(), 'results/model.pth')
torch.save(optimizer.state_dict(), 'results/optimizer.pth')
for epoch in range(1, n_epochs + 1):
train(epoch)
As you can see, the neuron network is divided into two sub-net: self.seq1 and self.seq2. please note: #network = nn.DataParallel(network) and #network.to(device) are commented out. You can run the model on a node with two GPUs exactly the same way as running the example for Data Parallelism on a single node. Although the "nvidia-smi" command shows both GPUs are utilized while the model is running, they are actually processing the data alternatively, thus no overall performance improvement can be observed.
Model Parallelism with Pipelining
Model Parallelism can divide a large model into sub-models and run one sub-network on each GPU, thus making it possible for models with large neuron networks to utilize GPUs in case the models are too large to fit in a single GPU. However, this method typically does not speed up your training processes as data parallelism can. In this section of the tutorial, we introduce the technique called pipelining which can improve the training performance for Model Parallelism.
in the following example, we modify the above example for simple Model Parallelism by rewriting the "forward()" functions in the network class definition and keeping the rest of the code unchanged.
def forward(self, x):
splits = iter(x.split(self.split_size, dim=0))
s_next = next(splits)
s_prev = self.seq1(s_next).to('cuda:1')
ret = []
for s_next in splits:
# A. s_prev runs on cuda:1
s_prev = self.seq2(s_prev)
ret.append(F.log_softmax(s_prev, dim=1))
# B. s_next runs on cuda:0, which can run concurrently with A
s_prev = self.seq1(s_next).to('cuda:1')
s_prev = self.seq2(s_prev)
ret.append(F.log_softmax(s_prev, dim=1))
return torch.cat(ret)
As you can see, the input data is split into mini-batches, each mini-batch is processed in seq1 and then processed in seq2. Unlike the simple Model Parallelism where the two GPUs process the entire input data alternatively, the two GPUs process different mini-batches simultaneously like a pipeline, thus improving the performance.