Software suite for electronic structure calculations

Quantum Espresso is an integrated suite of open-source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves and pseudopotentials.

GPU Version

The NVIDIA GPU Cloud (NGC) hosts a Quantum Espresso container that is produced by SISSA. The container has been created to run on the A100, V100 and P100 GPUs of della-gpu and adroit. It also provides optimizations from CUDA-aware MPI. Apptainer must be used on our clusters when working with containers as illustrated below.

Below are a set of sample commands showing how to run the AUSURF112 benchmark:

$ ssh <YourNetID>@della-gpu.princeton.edu
$ mkdir -p software/quantum_espresso  # or another location
$ cd software/quantum_espresso
$ singularity pull docker://nvcr.io/hpc/quantum_espresso:v6.7  # check for newer version on NGC
$ cd /scratch/gpfs/<YourNetID>
$ mkdir qe_test && cd qe_test
$ wget https://repository.prace-ri.eu/git/UEABS/ueabs/-/raw/master/quantum_espresso/test_cases/small/Au.pbe-nd-van.UPF
$ wget https://repository.prace-ri.eu/git/UEABS/ueabs/-/raw/master/quantum_espresso/test_cases/small/ausurf.in

Below is a sample Slurm script (job.slurm):

#!/bin/bash
#SBATCH --job-name=qe-test       # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks-per-node=8      # number of tasks per node
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=32G                # total memory per node
#SBATCH --gres=gpu:2             # number of gpus per node
#SBATCH --time=00:15:00          # total run time limit (HH:MM:SS)
#SBATCH --gpu-mps                # enable cuda multi-process service
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu
module purge
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun --mpi=pmi2 \
singularity run --nv \
     $HOME/software/quantum_espresso/quantum_espresso_v6.7.sif \
     pw.x -input ausurf.in -npool 2

Note that CUDA Multi-Process Service is only available on della-gpu and adroit.

Submit the job:

$ sbatch job.slurm

The following benchmark data was generated for the case above on Della (GPU) in June 2021:

nodesntasks-per-nodecpus-per-taskGPUsExecution time (s)
1412228
1812184
1822164
11612175
1842156
21614140

The following was generated on TigerGPU:

nodesntasks-per-nodecpus-per-taskGPUsExecution time (s)
1814344
11614353

The AUSURF112 benchmark requires a lot of GPU memory. A single GPU does not provide enough memory to run the benchmark. The code was found to fail for large values of ntasks-per-node.

CPU Version

Della

The directions below can be used to build QE on Della for the CPU nodes. Users may need to modifiy the directions to build a custom version of the software.

$ ssh <YourNetID>@della.princeton.edu
$ mkdir -p software && cd software
$ wget https://github.com/QEF/q-e/releases/download/qe-6.8/qe-6.8-ReleasePack.tgz
$ tar zvxf qe-6.8-ReleasePack.tgz 
$ cd qe-6.8
$ mkdir build && cd build
$ module purge
$ module load openmpi/gcc/4.1.2
$ module load fftw/gcc/3.3.9
$ OPTFLAGS="-O3 -march=native -DNDEBUG"
# copy and paste the next 9 lines
$ cmake3 -DCMAKE_INSTALL_PREFIX=$HOME/.local \
         -DCMAKE_BUILD_TYPE=Release \
         -DCMAKE_Fortran_COMPILER=mpif90 \
         -DCMAKE_Fortran_FLAGS_RELEASE="$OPTFLAGS" \
         -DCMAKE_C_COMPILER=mpicc \
         -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \
         -DQE_ENABLE_OPENMP=ON \
         -DQE_FFTW_VENDOR=FFTW3 \
         -DBLA_VENDOR=OpenBLAS ..
$ make
$ make install

The resulting executables will be available in ~/.local/bin.

Below is a sample Slurm script:

#!/bin/bash
#SBATCH --job-name=qe-cpu        # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks-per-node=8      # number of tasks per node
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=64G                # total memory per node (4G per cpu-core is default)
#SBATCH --time=00:15:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu
#SBATCH --constraint=skylake     # exclude broadwell nodes
module purge
module load openmpi/gcc/4.1.2
module load fftw/gcc/3.3.9
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun $HOME/.local/bin/pw.x -input ausurf.in -npool 2 

Make sure you perform a scaling analysis to find the optimal number of nodes, ntasks-per-node and cpus-per-task. Also, what value should be used for npool?

The following was generated on Della in March 2022:

nodesntasks-per-nodecpus-per-taskExecution time (s)CPU efficiency
1321101996%
184159538%

The CPU efficiency was obtained from the "jobstats" command.

TigerCPU

Run the commands below to install version 7.0:

$ ssh <YourNetID>@tigercpu.princeton.edu
$ mkdir -p software && cd software  # or another location
$ wget https://github.com/QEF/q-e/archive/refs/tags/qe-7.0.tar.gz
$ tar zvxf qe-7.0.tar.gz
$ cd q-e-qe-7.0
$ module purge
$ module load rh/devtoolset/9
$ module load fftw/gcc/3.3.4
$ module load openmpi/gcc/3.1.5/64
$ OPTFLAGS="-O3 -march=native -DNDEBUG"
$ ./configure FFLAGS="$OPTFLAGS" CFLAGS="$OPTFLAGS" --prefix=$HOME/.local --enable-parallel
$ make pw
$ make install

Do not load the rh/devtoolset/9 module in your Slurm script.

 

Tiger3

When running the latest NGC container (quantum_espresso_qe-7.1.sif):

    0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)
    0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)
    0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)
    0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)
    0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)
    0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)
    0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)

The above is probably occurring because the container does not support the H100 GPUs of tiger3.