How to use MPI with OpenMP or multi-threaded Intel MKL

Normally, by following the instructions in each cluster's tutorial, every processor/core reserved via Slurm is assigned to a separate MPI process. However, in the event that an application combines MPI (usually between nodes), and OpenMP (within nodes), different instructions need to be followed.

One specific example of using OpenMP in an MPI job is when using Intel MKL. MKL internally uses OpenMP to execute the mathematical operations in a parallel fashion when configured to use the multi-threaded layer (click here for MKL configuration instructions). By default, OpenMP and multi-threaded MKL will use all CPU-cores in a node, but it may be desirable to allocate less than that when an application makes lighter usage of MKL or simply needs fewer parallel threads per process.

In either case, when using an MPI library, you must configure the number of threads per process as follows:

  • Set the number of MPI tasks you require by specifying the number of nodes (#SBATCH --nodes) and the number of MPI processes you desire per node (#SBATCH --ntasks-per-node) then specify the number of OpenMP threads per MPI process (#SBATCH --cpus-per-task).
  • Set OMP_NUM_THREADS to the number of OpenMP threads to be created for each MPI process.
  • Invoke srun as normal
The job below will have 8 MPI processes (4 per node), each with 10 OpenMP threads for a total of 80 CPU-cores:
#SBATCH --job-name=mpi_omp       # create a short name for your job
#SBATCH --nodes=2                # node count
#SBATCH --ntasks-per-node=4      # total number of tasks per node
#SBATCH --cpus-per-task=10       # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G per cpu-core is default)
#SBATCH --time=00:00:30          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>

module purge
module load intel/ intel-mpi/intel/2019.7


srun ./myprog

When the intel/ module is loaded the corresponding intel-mkl module is automatically loaded.


(ntasks-per-node x cpus-per-task)

must be less than or equal to the number of cores per node for the particular machine.

Note: If you are not using MPI and are simply running a single process (but multi-threaded) job, then follow the above instructions and set ntasks-per-node to 1.

Additionally, for hybrid MPI/OpenMP jobs, the optimal values of nodes, cpus-per-task and ntasks-per-node are different for each job and must be determined empirically. For more see our scaling analysis page.