OVERVIEW

PyTorch is a Python-based scientific computing package. It is an automatic differentiation library that is useful to implement neural networks. Just like how you transfer a Tensor onto the GPU, you transfer the neural network onto the GPU. A Python virtual environment can be set up on Arc to install and run PyTorch. An example of using PyTorch with CIFAR-10 Dataset in the Python virtual environment on Arc with GPU nodes is discussed further in the sections below. The performance of the Deep
Learning models on CPU and GPU nodes are tabulated in the last part of this section.

INSTALLING AND RUNNING PYTORCH ON ARC

1. You need to log in to Arc using your UTSA credentials from a terminal on your local computer:

$ ssh -X username@arc.utsa.edu

**Note (Mac Users): - Mac users will have to download and install XQuartz for launching GUI-based applications on remote Linux systems.
**Note (Windows Users): -Windows users will have to download and install Xming/Mobaxterm for launching GUI-based applications on remote Linux systems.


2. log in to a compute node for an hour using the following command:

$ srun -p compute1 -N 1 -n 1 -t 01:00:00 --pty bash

Or log in to a GPU node with a single GPU:
$ srun -p gpu1v100 -N 1 -n 1 -t 01:00:00 --pty bash

Or log in to a GPU node with two GPUs in the gpu2v100 partition:

$ srun -p gpu2v100 -N 1 -n 1 -t 01:00:00 --pty bash

3. Create and activate a Python Virtual Environment (VE), enter the following commands sequentially

$ pip install virtualenv
$ virtualenv mypython
$ source mypython/bin/activate
$ pip install torch torchvision

Note:- To deactivate the environment, enter the command “deactivate mypython”.

Or you can use Anaconda to create a VE as follows (recommended on Arc):

$ module load anaconda3
$ conda create -n "mypython" python=3.9.0
$ conda activate mypython
$ conda install -c pytorch pytorch torchvision

4. To check the availability of the GPU units on a node, use the following command:

$ $ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100S-PCI... Off | 00000000:3B:00.0 Off | Off |
| N/A 32C P0 35W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Note: you will see two GPUs if you are on a GPU node in the partition gpu2v100, and see "bash: nvidia-smi: command not found"on a regular compute node.

RUNNING THE PYTORCH EXAMPLE

The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class. A sample deep learning model on CIFAR10 dataset using PyTorch and the computation power of the GPU is shown below. All this code does is train the model on the CIFAR10 dataset to classify images of different birds, animals, etc and calculate the prediction accuracy on the test dataset.

import torch
import torchvision
import torchvision.transforms as transforms
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Assuming that we are on a CUDA machine, this should print a CUDA device:
print(device)
transform = transforms.Compose(
    [transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
batch_size = 4
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
shuffle=True, num_workers=2)
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
shuffle=False, num_workers=2)
classes = ('plane', 'car', 'bird', 'cat',
    'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
dataiter = iter(trainloader)
images, labels = dataiter.next()

class Net(nn.Module):
    def __init__(self):
      super().__init__()
      self.conv1 = nn.Conv2d(3, 216, 5)
      self.pool = nn.MaxPool2d(2, 2)
      self.conv2 = nn.Conv2d(216, 16, 5)
      self.fc1 = nn.Linear(16 * 5 * 5, 120)
      self.fc2 = nn.Linear(120, 84)
      self.fc3 = nn.Linear(84, 10)
    def forward(self, x):
      x = self.pool(F.relu(self.conv1(x)))
      x = self.pool(F.relu(self.conv2(x)))
      x = torch.flatten(x, 1) # flatten all dimensions except batch
      x = F.relu(self.fc1(x))
      x = F.relu(self.fc2(x))
      x = self.fc3(x)
      return x


net = Net()
net.to(device)
import torch.optim as optim
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

for epoch in range(2): # loop over the dataset multiple times
    running_loss = 0.0
    a = enumerate(trainloader, 0)
    for i, data in a:
      # get the inputs; data is a list of [inputs, labels]
      inputs, labels = data[0].to(device), data[1].to(device)
      # zero the parameter gradients
      optimizer.zero_grad()
      # forward + backward + optimize
      outputs = net(inputs)
      loss = criterion(outputs, labels)
      loss.backward()
      optimizer.step()
      # print statistics
      running_loss += loss.item()
      if i % 2000 == 1999: # print every 2000 mini-batches
        print('[%d, %5d] loss: %.3f' %
             (epoch + 1, i + 1, running_loss / 2000))
        running_loss = 0.0

print('Finished Training')
PATH = './cifar_net.pth'
torch.save(net.state_dict(), PATH)
dataiter = iter(testloader)
images, labels = dataiter.next()
net = Net()
net.load_state_dict(torch.load(PATH))
outputs = net(images)
_, predicted = torch.max(outputs, 1)
correct = 0
total = 0
# since we're not training, we don't need to calculate the gradients for our outputs
with torch.no_grad():
    for data in testloader:
      images, labels = data
      # calculate outputs by running images through the network
      outputs = net(images)
      # the class with the highest energy is what we choose as prediction
      _, predicted = torch.max(outputs.data, 1)
      total += labels.size(0)


      correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (
    100 * correct / total))
# prepare to count predictions for each class
correct_pred = {classname: 0 for classname in classes}
total_pred = {classname: 0 for classname in classes}
# again no gradients needed
with torch.no_grad():
    for data in testloader:
      images, labels = data
      outputs = net(images)
      _, predictions = torch.max(outputs, 1)
      # collect the correct predictions for each class
      for label, prediction in zip(labels, predictions):
        if label == prediction:
          correct_pred[classes[label]] += 1
        total_pred[classes[label]] += 1
# print accuracy for each class
for classname, correct_count in correct_pred.items():
    accuracy = 100 * float(correct_count) / total_pred[classname]
    print("Accuracy for class {:5s} is: {:.1f} %".format(classname,
                accuracy))

The Deep Learning Model can be run either in batch mode using a Slurm batch job script or interactively on a GPU node.
Running the model in the existing Interactive Mode:

The program can be run interactively on a GPU node using the following set of commands and the output will be displayed on the terminal:

[username@gpu001]$ time python3 program_name.py
cuda:0
[1, 2000] loss: 1.997
[1, 4000] loss: 1.656
[1, 6000] loss: 1.528
[1, 8000] loss: 1.417
[1, 10000] loss: 1.362
[1, 12000] loss: 1.326
[2, 2000] loss: 1.252
[2, 4000] loss: 1.206
[2, 6000] loss: 1.167
[2, 8000] loss: 1.164
[2, 10000] loss: 1.132
[2, 12000] loss: 1.091
Finished Training
Accuracy of the network on the 10000 test images: 62 %
Accuracy for class plane is: 69.4 %
Accuracy for class car is: 78.6 %
Accuracy for class bird is: 41.1 %
Accuracy for class cat is: 48.1 %
Accuracy for class deer is: 61.2 %
Accuracy for class dog is: 51.8 %
Accuracy for class frog is: 59.7 %
Accuracy for class horse is: 71.5 %
Accuracy for class ship is: 72.2 %
Accuracy for class truck is: 75.6 %

real 1m39.540s
user 2m40.883s
sys 0m24.707s

Note: The above example is designed for using a single GPU to accelerate the training. For models using multiple-GPU, please see the tutorial "Using PyTorch with Multiple GPUs" on our support site.
Running the model in Batch-Mode:

A sample Slurm batch job script to run the python program of the Deep Learning model in batch mode is shown below. This script should be run from a login node:

#!/bin/bash
#SBATCH -J program_name
#SBATCH -o program_name.txt
#SBATCH -p gpu1v100
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --time=01:00:00
source mypython/bin/activate
time python3 program_name.py

If you are currently on a GPU node and would like to switch back to the login node then please enter the exit command as follows:

$ (mypython)[username@gpu001]$ exit

The job script shown above can be submitted as follows:

[username@login001]$ sbatch job_script1.slurm

The output from the Slurm batch job can be checked by opening the output file as follows:

(mypython)[username@login001]$ cat program_name.txt
cuda:0
[1, 2000] loss: 1.989
[1, 4000] loss: 1.659
[1, 6000] loss: 1.513
[1, 8000] loss: 1.415
[1, 10000] loss: 1.372
[1, 12000] loss: 1.320
[2, 2000] loss: 1.233
[2, 4000] loss: 1.211
[2, 6000] loss: 1.177
[2, 8000] loss: 1.159
[2, 10000] loss: 1.131
[2, 12000] loss: 1.122
Finished Training
Accuracy of the network on the 10000 test images: 61 %
Accuracy for class plane is: 61.9 %
Accuracy for class car is: 76.9 %
Accuracy for class bird is: 53.7 %
Accuracy for class cat is: 35.7 %
Accuracy for class deer is: 56.7 %
Accuracy for class dog is: 55.7 %
Accuracy for class frog is: 61.2 %
Accuracy for class horse is: 73.1 %
Accuracy for class ship is: 80.3 %
Accuracy for class truck is: 64.5 %
real 1m39.540s
user 2m40.883s
sys 0m24.707

PERFORMANCE OF IMAGE CLASSIFICATION ON CPU and GPU

Performance parameters of Image Classification:
Node Number of GPUs Accuracy Time
compute1(CPU) 0 62% real 3m0.464s
user 112m34.797s
sys 1m48.916s
gpu1v100 (GPU) 1 62% real 1m39.540s
user 2m40.883s
sys 0m24.707s
The table shown above contrasts the training performance of models running on CPU and GPU nodes. The model’s training time decreases when it runs on GPU. This shows the computation power of GPU compared to CPU. The training accuracy has very less effect with the node on which it is running.

REFERENCES

1. Data Parallel Tutorial
2. Cifar-10 Tutorial
3. Cifar-10 Dataset HTML
Topic revision: r2 - 18 Apr 2022, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback