mpi4py on HPC Clusters

mpi4py provides a Python interface to MPI or the Message-Passing Interface. It is useful for parallelizing Python scripts. Also be aware of multiprocessingdask and Slurm job arrays.

Do not use conda install mpi4py. This will install its own version of MPI instead of using one of the optimized versions that exist on the cluster. The version that comes with conda will work, but it will be very slow.

The proper way to install mpi4py is to use pip together with one of the MPI libraries that already exists on the cluster. What follows are step-by-step instructions on how to set up mpi4py on the Tiger cluster. The steps are similar for any other cluster that uses modules (e.g., Della or Adroit).

1. Connect to Tiger

$ ssh <YourNetID>@tiger.princeton.edu

where <YourNetID> is your Princeton University ID (e.g., aturing).

2. Create and Activate a Conda Environment

Load an Anaconda module and create a Python environment (see note below if using Python 3.9+):

$ module load anaconda3/2023.9
$ conda create --name fast-mpi4py python=3.8 -y
$ conda activate fast-mpi4py

You should list all your Conda packages on the "conda create" line above so that the dependencies can be worked out correctly from the start. Later in the procedure we will use pip. One should carry out all the needed Conda installs before using pip.

If you need Python 3.9 or above then you will probably encounter this error: "Could not build wheels for mpi4py". Please note that you will need a workaround when you install mpi4py using pip in Step 3 below.

3. Install mpi4py in the Conda Environment

Load the MPI version you want to use. We recommend using Open MPI in this case. In order to get a list of all the available Open MPI versions on the cluster, run module avail openmpi. Good choices among others are openmpi/gcc/4.1.2, openmpi/gcc/4.0.4/64openmpi/gcc/3.1.5/64 or openmpi/gcc/2.0.2/64. Replace <x.y.z> in the line below with your choice and then run the command:

$ module load openmpi/gcc/<x.y.z>

You will also need to replace <x.y.z> in the Slurm script which is given below. The rh module (rh/devtoolset/8) needs to be loaded on Tiger to provide a more recent version of the GCC compiler suite. This is not needed for any other cluster.

Set the loaded version of MPI to be used with mpi4py:

$ export MPICC=$(which mpicc)

Note: You can check that this variable was set correctly by running echo $MPICC and making sure that it prints something like: /usr/local/openmpi/<x.y.z>/gcc/x86_64/bin/mpicc.

Finally, we install mpi4py using pip:

$ pip install mpi4py --no-cache-dir

Note: If you receive a "Requirement already satisfied" message then you may have pre-loaded the mpi4py environment module by mistake or already installed the package. Make sure you are not loading this environment module in your .bashrc file.

Workaround for Python 3.9+

If you remembered to look for this workaround or tried the above but received the "Could not build wheels for mpi4py" error, follow the steps below to successfully complete installation by swapping which version of ld pip will use:

cd /home/<NetID>/.conda/envs/fast-mpi4py/compiler_compat
rm -f ld
ln -s /usr/bin/ld ld

Now you can rerun the pip install command above and it should succeed. Afterwards, you should point "ld" in your conda environment back to what it was:

cd /home/<NetID>/.conda/envs/fast-mpi4py/compiler_compat
rm -f ld
ln -s ../bin/x86_64-conda-linux-gnu-ld ld

(The above was based on this Stack Overflow answer)

Verify Installation

Regardless of which version of Python you're using, you can check that mpi4py was properly installed by running

$ python -c "import mpi4py"

If the above command gives no error, mpi4py was successfully installed.

Hello World: mpi4py

In order to test our freshly installed mpi4py, we will run a simple "Hello World!" example. The following script is called hello_mpi.py and it uses mpi4py to go across multiple processors/nodes.

# hello_mpi.py:
# usage: python hello_mpi.py

from mpi4py import MPI
import sys

def print_hello(rank, size, name):
  msg = "Hello World! I am process {0} of {1} on {2}.\n"
  sys.stdout.write(msg.format(rank, size, name))

if __name__ == "__main__":
  size = MPI.COMM_WORLD.Get_size()
  rank = MPI.COMM_WORLD.Get_rank()
  name = MPI.Get_processor_name()

  print_hello(rank, size, name)

In order to submit this script to the job scheduler, we use the following Slurm script, called run.sh. Replace <x.y.z> in the module load line with the choice you made above. In this case we are running the above Python script on 4 processors on the same node (it works just the same if you need to go across nodes).

#!/bin/bash
#SBATCH --job-name=mpi4py-test   # create a name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=4               # total number of tasks
#SBATCH --cpus-per-task=1        # cpu-cores per task
#SBATCH --mem-per-cpu=1G         # memory per cpu-core
#SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
module load anaconda3/2023.9
module load openmpi/gcc/<x.y.z>  # REPLACE <x.y.z>
conda activate fast-mpi4py

srun python hello_mpi.py

Note: We explicitly load the anaconda3 and openmpi modules and activate the fast-mpi4py environment before running the hello_mpi.py script. These steps are necessary for the script to run.

We submit the job to the scheduler with:

$ sbatch run.sh

After the job finishes, we can check the output Slurm file, called slurm-xxxxxx.out, where xxxxxx is the Slurm job id. Inspecting this file, we see the following:

Hello World! I am process 0 of 4 on tiger-i26c2n1.
Hello World! I am process 1 of 4 on tiger-i26c2n1.
Hello World! I am process 2 of 4 on tiger-i26c2n1.
Hello World! I am process 3 of 4 on tiger-i26c2n1.

Which shows that our script ran successfully on 4 processors on tiger-i26c2n1 with each processor writing its status to STDOUT.

Bandwidth Test

Try running the Python script here to measure the bandwidth as a function of message size. Be sure to use srun instead of mpirun.