We have the GPU version of TensorFlow installed on the GPU nodes with the Python 3.6.1 module install (native Python 2.7 version is currently not working). This document will describe how to use TensorFlow and perform a quick test. Since we have limited GPU nodes, we are also providing instructions on how to use the CPU version of TensorFlow built within your home directory.

TensorFlow CPU Version

First, grab a compute node with qlogin and start a Python Virtualenv environment:
[abc123@login-0-0 ~]$ qlogin
local configuration login-0-0.cm.cluster not defined - using global configuration
Your job 71050 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 71050 has been successfully scheduled.
Establishing /cm/shared/apps/sge/var/cm/qlogin_wrapper session to host compute045.cm.cluster ...

[abc123@compute045 ~]$ virtualenv --system-site-packages tensorflow-cpu
New python executable in tensorflow-cpu/bin/python
Installing Setuptools..............................................................................................................................................................................................................................done.
Installing Pip.....................................................................................................................................................................................................................................................................................................................................done.

Activate the newly created Python virtualenv:
[abc123@compute045 ~]$ source ~/tensorflow-cpu/bin/activate
(tensorflow-cpu)[abc123@compute045 ~]$

Now we need to upgrade some default packages which are required by TensorFlow:
(tensorflow-cpu)[abc123@compute045 ~]$ pip install --upgrade setuptools
(tensorflow-cpu)[abc123@compute045 ~]$ pip install --upgrade enum34 futures

Now lets download, compile and install TensorFlow into your Python virtualenv:
(tensorflow-cpu)[abc123@compute045 ~]$ pip install --upgrade tensorflow

Once it is installed, lets perform a quick test to verfiy the CPU version is working (you can disregard the warning about AVX2 and FMA extensions):
(tensorflow-cpu)[abc123@compute045 ~]$ python
Python 2.7.5 (default, Aug 4 2017, 00:39:18)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2018-05-09 10:29:54.945974: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
>>> print(sess.run(hello))
Hello, TensorFlow!
>>>

Once TensorFlow is installed via the above instructions, you do not need to repeat the process. All you have to do is source the "activate" file the next time you want to run TensorFlow.

To disable the above warning message, add this to your Python code:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import tensorflow as tf

TensorFlow GPU Version

Grab a GPU node with qlogin (note - you first need permission to use the GPU nodes. If you require access to the GPU nodes send an email to rcsg@utsa.edu to request access):
[abc123@login-0-0 ~]$ qlogin -q gpu.q
local configuration login-0-0.cm.cluster not defined - using global configuration
Your job 54721 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 54721 has been successfully scheduled.
Establishing /cm/shared/apps/sge/var/cm/qlogin_wrapper session to host gpu02.cm.cluster ...
Last login: Mon Feb 26 16:31:34 2018 from login-0-0.cm.cluster

Load the modules required for TensorFlow (CUDA, cudNN, etc..)
[abc123@gpu02 ~]$ module load python/3.6.1 cuda90/toolkit cudnn/7.0

Enter the python shell:
[abc123@gpu02 ~]$ python3
Python 3.6.1 (default, May 16 2017, 15:27:50)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

Enter the following Hello World test program:
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
2018-02-27 08:14:55.501487: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-02-27 08:14:56.177280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:06:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:56.412525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 1 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:07:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:56.654330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 2 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:0a:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:56.902953: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 3 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:0b:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:57.180367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 4 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:86:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:57.465869: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 5 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:87:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:57.768378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 6 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:8a:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:58.047947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 7 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:8b:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-02-27 08:14:58.052993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Device peer to peer matrix
2018-02-27 08:14:58.053262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1126] DMA: 0 1 2 3 4 5 6 7
2018-02-27 08:14:58.053276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 0: Y Y Y Y N N N N
2018-02-27 08:14:58.053284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 1: Y Y Y Y N N N N
2018-02-27 08:14:58.053292: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 2: Y Y Y Y N N N N
2018-02-27 08:14:58.053300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 3: Y Y Y Y N N N N
2018-02-27 08:14:58.053308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 4: N N N N Y Y Y Y
2018-02-27 08:14:58.053321: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 5: N N N N Y Y Y Y
2018-02-27 08:14:58.053328: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 6: N N N N Y Y Y Y
2018-02-27 08:14:58.053336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 7: N N N N Y Y Y Y
2018-02-27 08:14:58.053357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:06:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053379: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:07:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:0a:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:0b:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:4) -> (device: 4, name: Tesla K80, pci bus id: 0000:86:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:5) -> (device: 5, name: Tesla K80, pci bus id: 0000:87:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053422: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:6) -> (device: 6, name: Tesla K80, pci bus id: 0000:8a:00.0, compute capability: 3.7)
2018-02-27 08:14:58.053429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:7) -> (device: 7, name: Tesla K80, pci bus id: 0000:8b:00.0, compute capability: 3.7)
>>> print(sess.run(hello))
Hello, TensorFlow!
>>>

If you want to disable the above warning messages, add this to your code:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import tensorflow as tf

-- Jeremy - 09 May 2018

Important Note: By default, TensorFlow takes all GPU resources on a GPU node (8 GPU units on each GPU nodes on Shamu). It is required to change your code to avoid hogging the GPU node. For example, you need to put in the followings in your code if only GPU unit 1 and 3 are needed:
import os​
os.environ["CUDA_VISIBLE_DEVICES"]=”1,3"​

To check the availability of the GPU units on a node, use the following command:

[abc123@gpu02 ~]$ nvidia-smi

-- Zhiwei - 01 Oct. 2018
Topic revision: r8 - 04 Sep 2019, AdminUser
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding UTSA Research Support Group? Send feedback