MPI Basics
MPI is standardized and portable standard which defines the syntax and semantics of library routines for users writing portable parallel programs in C, C++, and Fortan. The detailed MPI documents can be found at
https://www.mpi-forum.org/docs/. Unlike the multi-threading technique, MPI allows the parallelization across multiple compute nodes on an HPC cluster. On Shamu, we have different implementations of MPI packages, including OPENMPI, MPICH, and MVAPICH. To use a specific version of MPI package, users need to load the corresponding module for that MPI package. For example, the following command will load OPENMPI 4.0.4:
abc123@shamu ~]$ module load openmpi/4.0.5
Here is a simple MPI C program. It creates multiple MPI processes depending on how the program is executed. The master process (with myid equal to 0) sends an integer 2 to all the child processes and then it waits for the reply messages from all the child processes. The child processes (with myid not equal to 0) receives the integer sent by the master process, and multiplies the integer with its myid ,and then send the results back to the master process.
#include <mpi.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#define TAG 0
int main(int argc, char *argv[])
{
int numprocs;
int myid;
int i, data, temp, result, total;
MPI_Status stat;
/* MPI programs start with MPI_Init; all 'N' processes exist thereafter */
MPI_Init(&argc,&argv);
/* find out how big the SPMD world is */
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
/* and this processes' rank is */
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
/* At this point, all programs are running equivalently, the rank
* distinguishes the roles of the programs, with
* rank 0 often used specially... */
if(myid == 0)
{
total = 0;
printf("%d: We have %d processors\n", myid, numprocs);
/* parent send out data */
for (i=1; i<numprocs;i++){
//data = rand();
data = 2;
MPI_Send(&data, 1, MPI_INT, i, TAG, MPI_COMM_WORLD);
printf("parent send %d to process %d for processing\n", data, i);
}
/*receive results*/
for (i=1; i<numprocs;i++)
{ MPI_Recv(&result, 1, MPI_INT, i, TAG, MPI_COMM_WORLD, &stat);
printf("result %d received from process %d\n", result, i);
}
//printf("total %ld\n", total);
}
else
{
/*work load*/
MPI_Recv(&temp, 1, MPI_INT, 0, TAG, MPI_COMM_WORLD, &stat);
temp = temp * myid;
MPI_Send(&temp, 1, MPI_INT, 0, TAG, MPI_COMM_WORLD);
}
/* MPI programs end with MPI Finalize; this is a weak synchronization point */
MPI_Finalize();
return 0;
Assume the file is named mpi-sample1.c. The following command will compile the MPI source code and create an executable mpi-sample1:
abc123@shamu ~]$ mpicc mpi-sample1.c -o mpi-sample1
Users can run an MPI program interactively or submit it as a batch job to a cluster. The following command runs the compiled MPI program above on a single node with a total 5 MPI processes in the interactive mode:
abc123@shamu ~]$ mpirun -n 5 mpi-sample1
0: We have 5 processors
parent send 2 to process 1 for processing
parent send 2 to process 2 for processing
parent send 2 to process 3 for processing
parent send 2 to process 4 for processing
result 2 received from process 1
result 4 received from process 2
result 6 received from process 3
result 8 received from process 4
Please note: if you use openmpi/4.0.4, you will see the following warning message when you execute the above code. To eliminate the warning, you can put
--mca btl '^openib' after the mpirun.
openmpi/4.0.5 was built to suppress the warning.
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.
Local host: compute032
Local adapter: mlx4_0
Local port: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: compute032
Local device: mlx4_0
--------------------------------------------------------------------------
Although MPI allows a user to run an MPI program across multiple nodes interactively, it is NOT permitted on Shamu to avoid the complexities. The batch job is the only choice if across-node executions are needed. Here is an example of job script:
In this job script, the mpi-sample1 program will create 40 processes on 2 nodes. The results will be put in a file named output_file.txt
#!/bin/bash
#
#SBATCH --job-name=my_job
#SBATCH --output=my_output_file.txt # Delete this line if you want the output file in slurm-jobID.out format. It will be different every time you submit the job.
#SBATCH --partition=defq # defq is the default queue as the all.q in SGE scripts
#SBATCH --time=00:05:00 # Time limit hrs:min:sec. It is an estimation about how long it will take to complete the job.
#SBATCH --ntasks=40 # The number of processes of your Parallel job
#SBATCH --nodes=2 # The minimum number of nodes your processes (specified in the last line) will be running on. Each node on Shamu can accommodate
# at least 32 tasks. Please use a small number to conserve the computing resources. 20 is the maximum number allowed on Shamu
#SBATCH --mail-type=ALL
#SBATCH --mail-user=my-email@utsa.edu
. /etc/profile.d/modules.sh
module load openmpi/4.0.5
mpirun -n $SLURM_NTASKS ./mpi-sample1
Assuming the script file is named as my-mpi-script.job, use the following command to submit the batch job
on a login node:
abc123@shamu ~]$ sbatch my-mpi-script.job
abc123@shamu ~]$ cat my_output_file.txt
Loading openmpi/4.0.5
--------------------------------------------------------------------------
0: We have 40 processors
parent send 2 to process 1 for processing
parent send 2 to process 2 for processing
parent send 2 to process 3 for processing
parent send 2 to process 4 for processing
parent send 2 to process 5 for processing
parent send 2 to process 6 for processing
parent send 2 to process 7 for processing
parent send 2 to process 8 for processing
parent send 2 to process 9 for processing
parent send 2 to process 10 for processing
parent send 2 to process 11 for processing
parent send 2 to process 12 for processing
parent send 2 to process 13 for processing
parent send 2 to process 14 for processing
parent send 2 to process 15 for processing
parent send 2 to process 16 for processing
parent send 2 to process 17 for processing
parent send 2 to process 18 for processing
parent send 2 to process 19 for processing
parent send 2 to process 20 for processing
parent send 2 to process 21 for processing
parent send 2 to process 22 for processing
parent send 2 to process 23 for processing
parent send 2 to process 24 for processing
parent send 2 to process 25 for processing
parent send 2 to process 26 for processing
parent send 2 to process 27 for processing
parent send 2 to process 28 for processing
parent send 2 to process 29 for processing
parent send 2 to process 30 for processing
parent send 2 to process 31 for processing
parent send 2 to process 32 for processing
parent send 2 to process 33 for processing
parent send 2 to process 34 for processing
parent send 2 to process 35 for processing
parent send 2 to process 36 for processing
parent send 2 to process 37 for processing
parent send 2 to process 38 for processing
parent send 2 to process 39 for processing
result 2 received from process 1
result 4 received from process 2
result 6 received from process 3
result 8 received from process 4
result 10 received from process 5
result 12 received from process 6
result 14 received from process 7
result 16 received from process 8
result 18 received from process 9
result 20 received from process 10
result 22 received from process 11
result 24 received from process 12
result 26 received from process 13
result 28 received from process 14
result 30 received from process 15
result 32 received from process 16
result 34 received from process 17
result 36 received from process 18
result 38 received from process 19
result 40 received from process 20
result 42 received from process 21
result 44 received from process 22
result 46 received from process 23
result 48 received from process 24
result 50 received from process 25
result 52 received from process 26
result 54 received from process 27
result 56 received from process 28
result 58 received from process 29
result 60 received from process 30
result 62 received from process 31
result 64 received from process 32
result 66 received from process 33
result 68 received from process 34
result 70 received from process 35
result 72 received from process 36
result 74 received from process 37
result 76 received from process 38
result 78 received from process 39
Advanced MPI Programming Tips
- Avoid Deadlocks
MPI_Recv() is a blocking function
The MPI_Recv() is a blocking function, which means that the program flow will stop at the function call until a message is received.
MPI_Send() is not blocking, but not guaranteed
MPI tries to make MPI_Send() not blocking. But it may not be non-blocking if the data to be sent grow large. Users must be aware of this fact as the deadlock may occur, as shown in the following case:
In process 0:
.
.
MPI_send(a, 1)
MPI_Recv(a, 1)
.
.
In process 1:
.
.
MPI_Send(a, 0)
MPI_Recv(a, 0)
.
.
- Use Non-Blocking Sends and Receives
The blocking send and receive approach can make the program control flow simpler, but it comes with a performance cost. Users should try to use the non-blocking send and receive functions if the performance cost due to the blocking receiving is a concern. This approach can also be used to avoid the deadlock.
Here is the list of the non-blocking functions:
MPI_Isend()
MPI_Ibsend()
MPI_Issend()
MPI_Irsend()
MPI_Irecv()
- Communication Modes of MPI_Send
Besides the non-blocking MPI_Isend functions, the MPI_Send has four communication modes that users can choose to improve an MPI program's efficiency.
- Standard mode: usually not blocking even if a matching receive for the message has not occurred yet on the other processes.
- Buffer Mode (MPI_Bsend): Similar to the standard mode, but it will never block.
- Synchronous Mode: The send function will only return then a matching receive has started.
- Ready Mode (MPI_Rsend): The function will only work if a matching receive is already waiting. Otherwise the behavior is undefined.
- Synchronization with MPI_Barrier()
When writing a MPI program, many algorithms require the processes to get into some controlled state before proceeding. MPI_Barrier() function can achieve the goal as shown in the following example:
for (i=1; i<5; i++){
MPI_Barrier(MPI_COMM_WORLD);
if (index==my_id)
printf("Process %d's result is %d.\n", my_id, result);
}
Barriers can help with debugging, and avoid possible race conditions. However, unnecessary barriers can lead to wasteful idle time and should be eventually eliminated.
- Collective Communications
Instead of point-to-point messages, the collective communication routins can not only provide convenience, but also offer much better efficiency. Here is the list of commonly used collective communication routines:
MPI_Scatter()
MPI_Gather()
MPI_Scatterv()
MPI_Gatherv()
MPI_Reduce()
Here is an example for MPI_Reduce().
Please note, the function has to been call by all the processes including the root process (with my_id=0).
float sum;
MPI_Reduce(&local_value, &sum, 1, MPI_FLOAT, MPI_SUM, 0,MPI_COMM_WORLD);
// Print the result
if (my_id == 0) {
printf("The global sum = %f", global_sum);
}
Besides the MPI_SUM, there are other reduction operations:
-
MPI_MAX
- Returns the maximum element.
-
MPI_MIN
- Returns the minimum element.
-
MPI_SUM
- Sums the elements.
-
MPI_PROD
- Multiplies all elements.
-
MPI_LAND
- Performs a logical and across the elements.
-
MPI_LOR
- Performs a logical or across the elements.
-
MPI_BAND
- Performs a bitwise and across the bits of the elements.
-
MPI_BOR
- Performs a bitwise or across the bits of the elements.
-
MPI_MAXLOC
- Returns the maximum value and the rank of the process that owns it.
-
MPI_MINLOC
- Returns the minimum value and the r
-- Zhiwei - 03 Sep 2020