Single Node Parallelization
A program in Python programming language can utilize the power of the multiple CPU/Core on a computer to significantly improve the performance. Unlike other programming languages, Python multiple threading does NOT improve the running speed of a program. Only one of the threads created in the program can be executed at a time. This is a known issue called the GIL (global interpreter lock), which makes all threads not run simultaneously (even on a multiple CPU/core computer). Each thread will run a few milliseconds of CPU time one after another. Here is an example of a multi-threading program:
import time
import threading
import concurrent.futures
start = time.perf_counter()
threads = []
num_thread = 6
def workload(i):
x = i
for _ in range(int(100000000/num_thread)):
x = x + 3.14*3.14
return x
with concurrent.futures.ThreadPoolExecutor() as executor:
for i in range(num_thread):
f = executor.submit(workload, i)
threads.append(f)
for t in threads:
print(t.result())
finish = time.perf_counter()
run_time = finish - start
print(f'finish in {round(run_time, 3)} second')
Execute the program on a six-core computer. The result and the run time are shown below:
[abc123@shamu ~]$python3 test.py
164326660.0670211
164326661.0670211
164326662.0670211
164326663.0670211
164326664.0670211
164326665.0670211
finish in 3.166 second
Now, change num_thread to 1, and run the program again:
[abc123@shamu ~]$python3 test.py
985959997.8863599
finish in 3.045 second
As you can see, it takes about the same amount of time to execute the program with one thread and six threads.
The only way to utilize multiple cores is to write the program with multiple processing techniques. Unlike the C programing language, the multiple processing programming in Python is very easy. With the help from concurrent.futures module, all the complex details of inter-process communication are hidden to the users. Here is the multiple processing version of the above program
import time
import threading
import concurrent.futures
start = time.perf_counter()
threads = []
num_thread = 6
def workload(i):
x = i
for _ in range(int(100000000/num_thread)):
x = x + 3.14*3.14
return x
with concurrent.futures.ProcessPoolExecutor() as executor:
for i in range(num_thread):
f = executor.submit(workload, i)
threads.append(f)
for t in threads:
print(t.result())
finish = time.perf_counter()
run_time = finish - start
print(f'finish in {round(run_time, 3)} second')
The only difference is that it uses 'with concurrent.futures.ProcessPoolExecutor() as executor:' instead of with concurrent.futures.ThreadPoolExecutor() as executor:'
Let run the program with num_thread set to 6:
[abc123@shamu ~]$python3 test.py
164326660.0670211
164326661.0670211
164326662.0670211
164326663.0670211
164326664.0670211
164326665.0670211
finish in 0.639 second
The program finishes in 0.639 seconds, roughly 1/5 of the amount of time that the multiple thread version takes.
Cross-node Parallelization - MPI
With the help of mpi4py package, A Python program can run on multiple nodes on a HPC cluster. Here is the sample code (assuming named mpi-sample.py) and sample Slurm job script (assuming named mpi.job):
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
if rank == 0:
msg = "Hello, world"
for i in range(1,size):
comm.send(msg, dest=i)
print(f"master send message {msg} to process {i}")
else:
s = comm.recv(source=0)
print(f"process {rank} received {s}")
#!/bin/bash
##
#SBATCH --job-name=mpi_job
#SBATCH --output=out.txt
##
#SBATCH --mail-type=ALL
#SBATCH --mail-user=zhiwei.wang@utsa.edu
#SBATCH --ntasks=100
#SBATCH --nodes=2
. /etc/profile.d/modules.sh
# Load one of these
module load shared openmpi/3.0.0
module load python/3.6.1
mpirun -n $SLURM_NTASKS python3 mpi-sample.py
Submit the job on a login node, view the content of the output file:
[abc123@shamu ~]$sbatch mpi.job
[abc123@shamu ~]$cat out.txt
master send message Hello, world to process 1
master send message Hello, world to process 2
master send message Hello, world to process 3
master send message Hello, world to process 4
master send message Hello, world to process 5
master send message Hello, world to process 6
master send message Hello, world to process 7
process 1 received Hello, world
process 2 received Hello, world
process 3 received Hello, world
process 4 received Hello, world
process 5 received Hello, world
process 6 received Hello, world
process 7 received Hello, world
-- Zhiwei - 10 Aug 2020