mpi4py on HPC Clusters

mpi4py provides a Python interface to MPI or the Message-Passing Interface. It is useful for parallelizing Python scripts. Also be aware of multiprocessingdask and Slurm job arrays.

Do not use conda install mpi4py. This will install its own version of MPI instead of using one of the optimized versions that exist on the cluster. The version that comes with conda will work, but it will be very slow.

The proper way to install mpi4py is to use pip together with one of the MPI libraries that already exists on the cluster. What follows are step-by-step instructions on how to set up mpi4py on the Tiger cluster. The steps are similar for any other cluster that uses modules (e.g., Della or Adroit).

1. Connect to Tiger

$ ssh <YourNetID>

where <YourNetID> is your Princeton University ID (e.g., aturing).

2. Create and Activate a Conda Environment

Load an Anaconda module and create a Python environment (see note below if using Python 3.9+):

$ module load anaconda3/2023.9
$ conda create --name fast-mpi4py python=3.8 -y
$ conda activate fast-mpi4py

You should list all your Conda packages on the "conda create" line above so that the dependencies can be worked out correctly from the start. Later in the procedure we will use pip. One should carry out all the needed Conda installs before using pip.

If you need Python 3.9 or above then you will probably encounter this error: "Could not build wheels for mpi4py". The solution is explained here but the last command when revoking the change should be:

$ ln -s ../bin/x86_64-conda-linux-gnu-ld ld

3. Install mpi4py in the Conda Environment

Load the MPI version you want to use. We recommend using Open MPI in this case. In order to get a list of all the available Open MPI versions on the cluster, run module avail openmpi. Good choices among others are openmpi/gcc/4.1.2, openmpi/gcc/4.0.4/64openmpi/gcc/3.1.5/64 or openmpi/gcc/2.0.2/64. Replace <x.y.z> in the line below with your choice and then run the command:

$ module load openmpi/gcc/<x.y.z>

You will also need to replace <x.y.z> in the Slurm script which is given below. The rh module (rh/devtoolset/8) needs to be loaded on Tiger to provide a more recent version of the GCC compiler suite. This is not needed for any other cluster.

Set the loaded version of MPI to be used with mpi4py:

$ export MPICC=$(which mpicc)

Note: You can check that this variable was set correctly by running echo $MPICC and making sure that it prints something like: /usr/local/openmpi/<x.y.z>/gcc/x86_64/bin/mpicc.

Finally, we install mpi4py using pip:

$ pip install mpi4py --no-cache-dir

Note: If you receive a "Requirement already satisfied" message then you may have pre-loaded the mpi4py environment module by mistake or already installed the package. Make sure you are not loading this environment module in your .bashrc file.

When the installation is finished, check that it was properly installed by running

$ python -c "import mpi4py"

If the above command gives no error, mpi4py was successfully installed.

**NOTE** As mentioned above, these instructions work when Python 3.8 is used in the virtual environment. If you need Python 3.9 or above then you will probably encounter this error: "Could not build wheels for mpi4py". The solution is explained here but the last command when revoking the change should be:

$ ln -s ../bin/x86_64-conda-linux-gnu-ld ld

Hello World: mpi4py

In order to test our freshly installed mpi4py, we will run a simple "Hello World!" example. The following script is called and it uses mpi4py to go across multiple processors/nodes.

# usage: python

from mpi4py import MPI
import sys

def print_hello(rank, size, name):
  msg = "Hello World! I am process {0} of {1} on {2}.\n"
  sys.stdout.write(msg.format(rank, size, name))

if __name__ == "__main__":
  size = MPI.COMM_WORLD.Get_size()
  rank = MPI.COMM_WORLD.Get_rank()
  name = MPI.Get_processor_name()

  print_hello(rank, size, name)

In order to submit this script to the job scheduler, we use the following Slurm script, called Replace <x.y.z> in the module load line with the choice you made above. In this case we are running the above Python script on 4 processors on the same node (it works just the same if you need to go across nodes).

#SBATCH --job-name=mpi4py-test   # create a name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=4               # total number of tasks
#SBATCH --cpus-per-task=1        # cpu-cores per task
#SBATCH --mem-per-cpu=1G         # memory per cpu-core
#SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>

module purge
module load anaconda3/2023.9
module load openmpi/gcc/<x.y.z>  # REPLACE <x.y.z>
conda activate fast-mpi4py

srun python

Note: We explicitly load the anaconda3 and openmpi modules and activate the fast-mpi4py environment before running the script. These steps are necessary for the script to run.

We submit the job to the scheduler with:

$ sbatch

After the job finishes, we can check the output Slurm file, called slurm-xxxxxx.out, where xxxxxx is the Slurm job id. Inspecting this file, we see the following:

Hello World! I am process 0 of 4 on tiger-i26c2n1.
Hello World! I am process 1 of 4 on tiger-i26c2n1.
Hello World! I am process 2 of 4 on tiger-i26c2n1.
Hello World! I am process 3 of 4 on tiger-i26c2n1.

Which shows that our script ran successfully on 4 processors on tiger-i26c2n1 with each processor writing its status to STDOUT.

Bandwidth Test

Try running the Python script here to measure the bandwidth as a function of message size. Be sure to use srun instead of mpirun.