PyTorch is a Python-based machine learning framework that can run on both CPUs and GPUs. It supports options for taking advantage of GPUs on single nodes or multiple nodes using data parallelism and model parallelism.

Data parallelism involves splitting the data into smaller subsets and using multiple GPUs to process the subsets in parallel while using the same model. PyTorch supports data parallelism through two classes namely DataParallel and DistributedDataParallel. The DataParallel class can be used when multiple GPUs are available on a single node while the DistributedDataParallel can be used with GPUs available on either single or multiple nodes. The PyTorch documentation however advises using DistributedDataParallel for using GPUs on both single and multiple nodes as it reduce the overall training time.

Model parallelism involves splitting the model into smaller segments such that each segment can be run on a different GPU. This method can be used when a model is too large to fit on a single GPU. This option creates dependencies between the GPUs such that one segment on a GPU waits for input from another segment running on another GPU before it can proceed (i.e., the segments of the model are prevented from running in parallel). Therefore, this kind of parallelism should not be used for training those models that can fit on a single GPU.

For running a PyTorch model on Arc, it is recommended to create a Python Virtual Environment (VE) with PyTorch related packages. You can use Anaconda to create a VE with Python 3.9.0 and install the required packages as shown below once you are connected to a compute node interactively (sample command to get interactive access to a node in the GPU queue is as follows:

srun -p gpu2v100 -N 1 -t 02:00:00 --pty bash
$ module load anaconda3
$ conda create --prefix=p39-torch python=3.9.0

After running the second command listed above, you will be prompted for a "y/n" response to accept the list of packages to be installed as shown below and you might want to type "y" to proceed:

The following NEW packages will be INSTALLED:

_libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main
_openmp_mutex pkgs/main/linux-64::_openmp_mutex-4.5-1_gnu
ca-certificates pkgs/main/linux-64::ca-certificates-2022.3.29-h06a4308_1
certifi pkgs/main/linux-64::certifi-2021.10.8-py39h06a4308_2
ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.35.1-h7274673_9
libffi pkgs/main/linux-64::libffi-3.3-he6710b0_2
libgcc-ng pkgs/main/linux-64::libgcc-ng-9.3.0-h5101ec6_17
libgomp pkgs/main/linux-64::libgomp-9.3.0-h5101ec6_17
libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-9.3.0-hd4cf53a_17
ncurses pkgs/main/linux-64::ncurses-6.3-h7f8727e_2
openssl pkgs/main/linux-64::openssl-1.1.1n-h7f8727e_0
pip pkgs/main/linux-64::pip-21.2.4-py39h06a4308_0
python pkgs/main/linux-64::python-3.9.0-hdb3f193_2
readline pkgs/main/linux-64::readline-8.1.2-h7f8727e_1
setuptools pkgs/main/linux-64::setuptools-61.2.0-py39h06a4308_0
sqlite pkgs/main/linux-64::sqlite-3.38.2-hc218d9a_0
tk pkgs/main/linux-64::tk-8.6.11-h1ccaba5_0
tzdata pkgs/main/noarch::tzdata-2022a-hda174b7_0
wheel pkgs/main/noarch::wheel-0.37.1-pyhd3eb1b0_0
xz pkgs/main/linux-64::xz-5.2.5-h7b6447c_0
zlib pkgs/main/linux-64::zlib-1.2.12-h7f8727e_2


Proceed ([y]/n)? y

Once the download and installation of the packages is complete, you will see messages as shown below:
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
# $ conda activate /work/vrv207/ml/p39-torch
#
# To deactivate an active environment, use
#
# $ conda deactivate

The conda activate path will be specific to your environment and hence you will see a different path in place of "/work/vrv207/ml/p39-torch" path that is shown above. You can copy the conda activate command shown on your screen during the installation process and proceed as follows:
$ conda activate /work/vrv207/ml/p39-torch
# Note: please replace the path in the command above with your own installation path

$ conda install -c pytorch pytorch torchvision
# Note: after running the command above you will be prompted for accepting or declining the download and installation of the PyTorch packages. Please accept by typing "y".

$ pip install Pillow==6.1

Data Parallelism

We can use Data Parallelism on a single node with multiple GPUs or on multiple nodes having single or multiple GPUs.
Single Node Data Parallelism

Here is an example of MNIST data classification utilizing multiple GPUs on a single node to improve the performance. In the example, we use "nn.DataParallel()" function to implement Data Parallelism if multiple GPUs are detected (see the portion of code in bold). The same code can be used on a single GPU environment without any changes since it can automatically detect the GPU settings.

import collections
import collections.abc

import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

n_epochs = 10
batch_size_train = 64
batch_size_test = 1000
learning_rate = 0.01
momentum = 0.5
log_interval = 10

random_seed = 1
torch.backends.cudnn.enabled = False
torch.manual_seed(random_seed)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_loader = torch.utils.data.DataLoader(
  torchvision.datasets.MNIST('files/', train=True, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.1307,), (0.3081,))
                             ])),
  batch_size=batch_size_train, shuffle=True)
test_loader = torch.utils.data.DataLoader(
  torchvision.datasets.MNIST('files/', train=False, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.1307,), (0.3081,))
                             ])),
  batch_size=batch_size_test, shuffle=True)

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x,  dim=1)

network = Net()
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  network = nn.DataParallel(network)
network.to(device)
optimizer = optim.SGD(network.parameters(), lr=learning_rate,
                      momentum=momentum)

train_losses = []
train_counter = []
test_losses = []
test_counter = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]

def train(epoch):
  network.train()
  for batch_idx, (data, target) in enumerate(train_loader):
    optimizer.zero_grad()
    inputs, label = data.to(device),target.to(device)
    output = network(inputs)
    loss = F.nll_loss(output, label)
    loss.backward()
    optimizer.step()
    if batch_idx % log_interval == 0:
      print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
        epoch, batch_idx * len(data), len(train_loader.dataset),
        100. * batch_idx / len(train_loader), loss.item()))
      train_losses.append(loss.item())
      train_counter.append(
        (batch_idx*64) + ((epoch-1)*len(train_loader.dataset)))
      torch.save(network.state_dict(), 'results/model.pth')
      torch.save(optimizer.state_dict(), 'results/optimizer.pth')


for epoch in range(1, n_epochs + 1):
  train(epoch)

To test the Data Parallelism model, you need to log onto a node with two GPUs in the gpu2v100 partition from a login node using the following command:
$ srun -p gpu2v100 -t 8:00:00 -n 1 -N 1 --pty bash

Assuming you save the example above as test.py, you can use the following command to run the model:
$ python test.py
Train Epoch: 1 [0/60000 (0%)]    Loss: 2.327492
Train Epoch: 1 [640/60000 (1%)]    Loss: 2.328194
Train Epoch: 1 [1280/60000 (2%)]    Loss: 2.278235
........

To verify if both GPUs are utilized to run the model, you can start a new ssh session and log onto the same GPU node where your model is running (you are only allowed to ssh to a node where you have a job running). Once you are on the same GPU node where your model is running, use the "nvidia-smi" command to check the status of the GPUs. You will see the output looks like the following which indicates both GPUs are utilized:
$ [gpu031: iqr224]$ nvidia-smi
Sat Apr 16 14:43:04 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.56       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100S-PCI...  Off  | 00000000:3B:00.0 Off |                  Off |
| N/A   31C    P0    35W / 250W |   1275MiB / 32510MiB |     21%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100S-PCI...  Off  | 00000000:D8:00.0 Off |                  Off |
| N/A   33C    P0    36W / 250W |   1275MiB / 32510MiB |     20%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    207116      C   python                           1271MiB |
|    1   N/A  N/A    207116      C   python                           1271MiB |
+-----------------------------------------------------------------------------+

Across Node Data Parallelism

in case you need more than two GPUs to run your model to further improve the performance, you would need to use Distributed Data Parallelism. Here is a tutorial about how to train a PyTorch-based Deep Learning model using multiple GPU devices across multiple nodes on an HPC cluster:

https://tuni-itc.github.io/wiki/Technical-Notes/Distributed_dataparallel_pytorch/

There are a few bugs in the example code in the tutorial. Make sure to change the following lines of code:
model = AE(input_shape=784).cuda(args.gpus) 
model = torch.nn.parallel.DistributedDataParallel( model_sync, device_ids=[args.gpu], find_unused_parameters=True )

to
model = AE(input_shape=784).cuda(args.gpu) 
model = torch.nn.parallel.DistributedDataParallel( model, device_ids=[args.gpu], find_unused_parameters=True )

Model Parallelism

In order to run a model on a GPU, the entire model needs to be transferred to the GPU(s). Data Parallelism (single node or multiple nodes) can improve the performance, but it cannot solve the situation where a model is too large to fit into the memory of a single GPU. Model Parallelism resolved the problem by dividing the model into a few sub-models depending on the number of GPUs available on a node and letting each GPU host one of the sub-modes. In order to apply Model Parallelism, we need to redefine the model by manually dividing the neuron network of the model into sub-networks. in the following example, we redefine the same neuron network in the example for Data Parallelism above.
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.seq1 = nn.Sequential(
nn.Conv2d(1, 10, kernel_size=5),
nn.MaxPool2d(2),
nn.ReLU(),
).to('cuda:0')

self.seq2 = nn.Sequential(
nn.Conv2d(10, 20, kernel_size=5),
nn.Dropout2d(),
nn.MaxPool2d(2),
nn.ReLU(),
nn.Flatten(),
nn.Linear(320, 50),
nn.ReLU(),
nn.Dropout2d(),
nn.Linear(50, 10)
).to('cuda:1')
def forward(self, x):

x = self.seq1(x).to('cuda:1')
x = self.seq2(x)
return F.log_softmax(x, dim=1)

network = Net()
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
#network = nn.DataParallel(network)

#network.to(device)
optimizer = optim.SGD(network.parameters(), lr=learning_rate,
momentum=momentum)

train_losses = []
train_counter = []
test_losses = []
test_counter = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]

def train(epoch):
network.train()
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
inputs, label = data.to('cuda:0'),target.to('cuda:1')
output = network(inputs)
loss = F.nll_loss(output, label)
loss.backward()
optimizer.step()
if batch_idx % log_interval == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item()))
train_losses.append(loss.item())
train_counter.append(
(batch_idx*64) + ((epoch-1)*len(train_loader.dataset)))
torch.save(network.state_dict(), 'results/model.pth')
torch.save(optimizer.state_dict(), 'results/optimizer.pth')

for epoch in range(1, n_epochs + 1):
train(epoch)

As you can see, the neuron network is divided into two sub-net: self.seq1 and self.seq2. please note: #network = nn.DataParallel(network) and #network.to(device) are commented out. You can run the model on a node with two GPUs exactly the same way as running the example for Data Parallelism on a single node. Although the "nvidia-smi" command shows both GPUs are utilized while the model is running, they are actually processing the data alternatively, thus no overall performance improvement can be observed.

Model Parallelism with Pipelining

Model Parallelism can divide a large model into sub-models and run one sub-network on each GPU, thus making it possible for models with large neuron networks to utilize GPUs in case the models are too large to fit in a single GPU. However, this method typically does not speed up your training processes as data parallelism can. In this section of the tutorial, we introduce the technique called pipelining which can improve the training performance for Model Parallelism.

in the following example, we modify the above example for simple Model Parallelism by rewriting the "forward()" functions in the network class definition and keeping the rest of the code unchanged.
 def forward(self, x):
splits = iter(x.split(self.split_size, dim=0))
s_next = next(splits)
s_prev = self.seq1(s_next).to('cuda:1')
ret = []

for s_next in splits:
# A. s_prev runs on cuda:1
s_prev = self.seq2(s_prev)
ret.append(F.log_softmax(s_prev, dim=1))

# B. s_next runs on cuda:0, which can run concurrently with A
s_prev = self.seq1(s_next).to('cuda:1')

s_prev = self.seq2(s_prev)
ret.append(F.log_softmax(s_prev, dim=1))
return torch.cat(ret)

As you can see, the input data is split into mini-batches, each mini-batch is processed in seq1 and then processed in seq2. Unlike the simple Model Parallelism where the two GPUs process the entire input data alternatively, the two GPUs process different mini-batches simultaneously like a pipeline, thus improving the performance.
Topic revision: r10 - 28 Oct 2024, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback