Quantum Espresso

Quantum Espresso is an integrated suite of open-source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves and pseudopotentials.

The NVIDIA GPU Cloud (NGC) hosts a Quantum Espresso container that is produced by SISSA. The container has been created to run on the A100, V100 and P100 GPUs of della-gpu, adroit and tigergpu. It also provides optimizations from CUDA-aware MPI. Singularity must be used on our clusters when working with containers as illustrated below.

Below are a set of sample commands showing how to run the AUSURF112 benchmark:

$ ssh <YourNetID>@della-gpu.princeton.edu
$ mkdir -p software/quantum_espresso
$ cd software/quantum_espresso
$ singularity pull docker://nvcr.io/hpc/quantum_espresso:v6.7
$ cd /scratch/gpfs/<YourNetID>
$ mkdir qe_test && cd qe_test
$ wget https://repository.prace-ri.eu/git/UEABS/ueabs/-/raw/master/quantum_espresso/test_cases/small/Au.pbe-nd-van.UPF
$ wget https://repository.prace-ri.eu/git/UEABS/ueabs/-/raw/master/quantum_espresso/test_cases/small/ausurf.in

Below is a sample Slurm script (job.slurm):

#!/bin/bash
#SBATCH --job-name=qe-test       # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks-per-node=8      # number of tasks per node
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=32G                # total memory per node
#SBATCH --gres=gpu:2             # number of gpus per node
#SBATCH --time=00:15:00          # total run time limit (HH:MM:SS)
#SBATCH --gpu-mps                # enable cuda multi-process service

module purge
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun --mpi=pmi2 \
singularity run --nv \
     $HOME/software/quantum_espresso/quantum_espresso_v6.7.sif \
     pw.x -input ausurf.in -npool 2

Note that CUDA Multi-Process Service is only available on della-gpu and adroit.

Submit the job:

$ sbatch job.slurm

The following benchmark data was generated for the case above on della-gpu in June 2021:

nodes ntasks-per-node cpus-per-task GPUs execution time (s)
1 4 1 2 228
1 8 1 2 184
1 8 2 2 164
1 16 1 2 175
1 8 4 2 156
2 16 1 4 140

The following was generated on tigergpu:

nodes ntasks-per-node cpus-per-task GPUs execution time (s)
1 8 1 4 344
1 16 1 4 353

The AUSURF112 benchmark requires a lot of GPU memory. A single GPU does not provide enough memory to run the benchmark. The code was found to fail for large values of ntasks-per-node.