mpi4py on HPC Clusters

mpi4py provides a Python interface to MPI or the Message-Passing Interface. It is useful for parallelizing Python scripts. Also be aware of multiprocessingdask and Slurm job arrays.

First, do not use conda install mpi4py. This will install its own version of MPI instead of using one of the versions that already exists on the cluster. It will work, but it will be cripplingly slow. Second, while the module avail mpi4py command shows mpi4py environment modules, they may only be used for Python 2 code. Even in that case you should probably ignore the modules and follow the directions below with the appropriate modifications for Python 2.

The proper way to install mpi4py is to use pip together with one of the MPI libraries that already exists on the cluster. What follows are step-by-step instructions on how to set up mpi4py on the Tiger cluster. The steps are similar for any other cluster that uses modules (e.g., Della or Adroit).

1. Connect to Tiger

$ ssh <YourNetID>

where <YourNetID> is your Princeton University ID (e.g., aturing).

2. Create and Activate a Conda Environment

Load an Anaconda module and create a Python environment:

$ module load anaconda3/2020.11
$ conda create --name fast-mpi4py python=3.8 -y
$ conda activate fast-mpi4py

You should list all your Conda packages on the "conda create" line above so that the dependencies can be worked out correctly from the start. Later in the procedure we will use pip. One should carry out all the needed Conda installs before using pip.

3. Install mpi4py in the Conda Environment

Load the MPI version you want to use. We recommend using Open MPI in this case. In order to get a list of all the available Open MPI versions on the cluster, run module avail openmpi. Good choices among others are openmpi/gcc/4.0.4/64openmpi/gcc/3.1.5/64 or openmpi/gcc/2.0.2/64. Do not use openmpi/gcc/1.10.2/64. Replace <x.y.z> in the line below with your choice and then run the command:

$ module load openmpi/gcc/<x.y.z> rh/devtoolset/8

You will also need to replace <x.y.z> in the Slurm script which is given below. The rh module provides a more recent version of the GCC compiler suite. You should not load this module on Stellar or Traverse since the system GCC is sufficient. Set the loaded version of MPI to be used with mpi4py:

$ export MPICC=$(which mpicc)

Note: You can check that this variable was set correctly by running echo $MPICC and making sure that it prints something like: /usr/local/openmpi/<x.y.z>/gcc/x86_64/bin/mpicc.

Finally, we install mpi4py using pip:

$ pip install mpi4py

Note: If you receive a "Requirement already satisfied" message then you may have pre-loaded the mpi4py environment module by mistake or already installed the package. Make sure you are not loading this environment module in your .bashrc file.

When the installation is finished, check that it was properly installed by running

$ python -c "import mpi4py"

If the above command gives no error, mpi4py was successfully installed.


Hello World: mpi4py

In order to test our freshly installed mpi4py, we will run a simple "Hello World!" example. The following script is called and it uses mpi4py to go across multiple processors/nodes.

# usage: python

from mpi4py import MPI
import sys

def print_hello(rank, size, name):
  msg = "Hello World! I am process {0} of {1} on {2}.\n"
  sys.stdout.write(msg.format(rank, size, name))

if __name__ == "__main__":
  size = MPI.COMM_WORLD.Get_size()
  rank = MPI.COMM_WORLD.Get_rank()
  name = MPI.Get_processor_name()

  print_hello(rank, size, name)

In order to submit this script to the job scheduler, we use the following Slurm script, called Replace <x.y.z> in the module load line with the choice you made above. In this case we are running the above Python script on 4 processors on the same node (it works just the same if you need to go across nodes).

#SBATCH --job-name=mpi4py-test   # create a name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=4               # total number of tasks
#SBATCH --cpus-per-task=1        # cpu-cores per task
#SBATCH --mem-per-cpu=1G         # memory per cpu-core
#SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)

module purge
module load anaconda3/2020.11 openmpi/gcc/<x.y.z>  # REPLACE <x.y.z>
conda activate fast-mpi4py

srun python

Note: We explicitly load the anaconda3 and openmpi modules and activate the fast-mpi4py environment before running the script. These steps are necessary for the script to run.

We submit the job to the scheduler with:

$ sbatch

After the job finishes, we can check the output Slurm file, called slurm-xxxxxx.out, where xxxxxx is the Slurm job id. Inspecting this file, we see the following:

Hello World! I am process 0 of 4 on tiger-i26c2n1.
Hello World! I am process 1 of 4 on tiger-i26c2n1.
Hello World! I am process 2 of 4 on tiger-i26c2n1.
Hello World! I am process 3 of 4 on tiger-i26c2n1.

Which shows that our script ran successfully on 4 processors on tiger-i26c2n1 with each processor writing its status to STDOUT.

Bandwidth Test

Try running the Python script here to measure the bandwidth as a function of message size. Be sure to use srun instead of mpirun.