Some notes from the MPI course at EPCC, Summer 2016

MPI is the Message Passing Interface, a standard and series of libraries for writing parallel programs to run on distributed memory computing systems. Distributed memory systems are essentially a series of network computers, or compute nodes, each with their own processors and memory. The key difference between distributed memory systems and their shared-memory counterparts is that each compute node under the distributed model (MPI) has its own memory address space, and special messages must be sent between each node to exchange data. The message sending and receiving is a key part of writing MPI programs.

A very basic example: calculating pi

There are dozens of hello world example MPI programs, but these are fairly trivial examples and don’t really show how a real computing problem might be broken up and shared between compute nodes (Do you really need a supercomputer to std::cout “Hello, World!”?). This example uses an approximation for calculating pi to many significant digits. The approximation is given by:

\[\\begin{align} \frac{\pi}{4} & = \arctan(1)\;=\;\int_0^1 \frac 1{1+x^2} \, dx \\ & = \int_0^1\left(\sum_{k=0}^n (-1)^k x^{2k}+\frac{(-1)^{n+1}\,x^{2n+2} }{1+x^2}\right) \, dx \\\\ & = \sum_{k=0}^n \frac{(-1)^k}{2k+1} +(-1)^{n+1}\int_0^1\frac{x^{2n+2}}{1+x^2} \, dx. \\end{align} \\frac{\pi}4\;=\;\sum_{k=0}^\infty\frac{(-1)^k}{2k+1}\]

where the answer becomes more acurate with increasing N.

The pseudo-code for the partial sum of pi for each iteration would be:

this_bit_of_pi = this_bit_of_pi + 1.0 / ( 1.0 + ( (i-0.5) / N)*((i-0.5) / N) );

For a basic MPI-C++ program, the first bit of the program looks like this, including the MPI header and some variables declared:

#include "mpi.h"
#include <iostream>
#include <cmath>

int main()
  int rank;
  int size;

  int istart;
  int istop;

  MPI_Status status;
  MPI_Comm comm;
  comm = MPI_COMM_WORLD;

  int N = 840;
  double pi;

  double receive_pi;
  // this is used for later computation of pi
  // from partial sums


  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  std::cout << "Hello from rank " << rank << std::endl;

  //.. continues later

First, some variables are created to hold the rank, i.e. the current process, and the size, which is used to represent the total number of ranks, or processes.

istart and istop will be used to calculate the iteration loop counter start and stop positions for each separate process.

Secondly, the MPI_Status variable is defined, then the MPI_Comm type variable. These are special types defined in the MPI headers that relate to the message passing interface.

The MPI environment is initialised with MPI_Init(NULL, NULL);. The you can initialise the rank and size variables using the corresponding commands in MPI_Comm_rank() and MPI_Comm_size, passing a reference to the communicator object, and the respective variable.

By convention, the process with rank = 1 is used as the master process, and does the managing of collating data once it has been processed by the other ranks/processes.

The parameter N is used in our pi approximation and will determine the number of iterations we do. It is used to calculate the number of iterations distributed to each process:

  // Assuming N is exactly divisible by the number of processes
  istart = N/size * rank + 1;
  istop = istart + N/size -1;

Then each loop will calculate a partial sum of pi from its given subset of N. Because the MPI processes have been initailised, as well as the variables for rank and size, the code below will have unique values of istart and istop for each rank/process:

  // Check how the iterations have been divided up among 
  // the processes:
  std::cout << "On rank " << rank << " istart = " << istart \
            << ", istop = " << istop << std::endl;

  double this_bit_of_pi=0.0;
  for (int i=istart; i<=istop; i++)
    this_bit_of_pi = this_bit_of_pi + 1.0 / ( 1.0 + ( (i-0.5) / N)*((i-0.5) / N) );

  std::cout << "On rank " << rank << "Partial pi = " << this_bit_of_pi << std::endl;

Our partial sums have now been calculated, the last task is to collate them all together on the master process (rank=0), and sum them up:

  if (rank == 0) // By convention, rank 0 is the master
    // take the value of pi from the Master's local copy of the 
    // partial sum
    pi = this_bit_of_pi;
    for (int source = 1; source < size; source++)
      // Now we want to grab the parial pi sums from ranks
      // greater than zero (so start at source =1)
      int tag = 0;

      MPI_Recv(&receive_pi, 1, MPI_DOUBLE, MPI_ANY_SOURCE, tag, comm, &status);

      std::cout << "MASTER (rank 0) receiving from rank " << status.MPI_SOURCE \
                << std::endl;

      // Now add to received partial sum to the running total
      pi += receive_pi;

The key part of the above code is that we are telling the master process (rank=0) to be ready to receive all the partial sums of pi. MPI requires both send and receive calls. However, the send command has to be issued from the respective processes themselves (There’s no ‘get’ command per se, it’s a two-stage process that has to be set up on each process correctly).

Now we tell the non-master processes to send their partial pi sums:

  else // i.e. else not rank=0...not on the master process
    // Now send the other partial sums from each process to the master
    int tag = 0;

    std::cout << "Rank " << rank << " sending to rank 0 (MASTER)" << std::endl;
    // Use a synchronous send (Ssend)
    MPI_Ssend(&this_bit_of_pi, 1, MPI_DOUBLE, 0, tag, comm);

So now we’ve issued a command to send all the bits of pi, specified the data type, MPI_DOUBLE and passed the other arguments required by MPI_Ssend().

Finally, we can do the last bit of the calculation needed in the original formual by multiplying by four. Then finalise the MPI processes.

  pi = pi * 4.0/(static_cast<double>(N));



The full program is given in a github Gist, which I will either embedd or provide a link to soon.