Software suite for electronic structure calculations Quantum Espresso is an integrated suite of open-source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves and pseudopotentials.GPU VersionThe NVIDIA GPU Cloud (NGC) hosts a Quantum Espresso container that is produced by SISSA. The container has been created to run on the A100, V100 and P100 GPUs of della-gpu and adroit. It also provides optimizations from CUDA-aware MPI. Apptainer must be used on our clusters when working with containers as illustrated below.Below are a set of sample commands showing how to run the AUSURF112 benchmark:$ ssh <YourNetID>@della-gpu.princeton.edu $ mkdir -p software/quantum_espresso # or another location $ cd software/quantum_espresso $ singularity pull docker://nvcr.io/hpc/quantum_espresso:v6.7 # check for newer version on NGC $ cd /scratch/gpfs/<YourNetID> $ mkdir qe_test && cd qe_test $ wget https://repository.prace-ri.eu/git/UEABS/ueabs/-/raw/master/quantum_espresso/test_cases/small/Au.pbe-nd-van.UPF $ wget https://repository.prace-ri.eu/git/UEABS/ueabs/-/raw/master/quantum_espresso/test_cases/small/ausurf.in Below is a sample Slurm script (job.slurm):#!/bin/bash #SBATCH --job-name=qe-test # create a short name for your job #SBATCH --nodes=1 # node count #SBATCH --ntasks-per-node=8 # number of tasks per node #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem=32G # total memory per node #SBATCH --gres=gpu:2 # number of gpus per node #SBATCH --time=00:15:00 # total run time limit (HH:MM:SS) #SBATCH --gpu-mps # enable cuda multi-process service #SBATCH --mail-type=begin # send email when job begins #SBATCH --mail-type=end # send email when job ends #SBATCH --mail-user=<YourNetID>@princeton.edu module purge export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK srun --mpi=pmi2 \ singularity run --nv \ $HOME/software/quantum_espresso/quantum_espresso_v6.7.sif \ pw.x -input ausurf.in -npool 2 Note that CUDA Multi-Process Service is only available on della-gpu and adroit.Submit the job:$ sbatch job.slurm The following benchmark data was generated for the case above on Della (GPU) in June 2021:nodesntasks-per-nodecpus-per-taskGPUsExecution time (s)14122281812184182216411612175184215621614140The following was generated on TigerGPU:nodesntasks-per-nodecpus-per-taskGPUsExecution time (s)181434411614353The AUSURF112 benchmark requires a lot of GPU memory. A single GPU does not provide enough memory to run the benchmark. The code was found to fail for large values of ntasks-per-node.CPU VersionDellaThe directions below can be used to build QE on Della for the CPU nodes. Users may need to modifiy the directions to build a custom version of the software.$ ssh <YourNetID>@della.princeton.edu $ mkdir -p software && cd software $ wget https://github.com/QEF/q-e/releases/download/qe-6.8/qe-6.8-ReleasePack.tgz $ tar zvxf qe-6.8-ReleasePack.tgz $ cd qe-6.8 $ mkdir build && cd build $ module purge $ module load openmpi/gcc/4.1.2 $ module load fftw/gcc/3.3.9 $ OPTFLAGS="-O3 -march=native -DNDEBUG" # copy and paste the next 9 lines $ cmake3 -DCMAKE_INSTALL_PREFIX=$HOME/.local \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_Fortran_COMPILER=mpif90 \ -DCMAKE_Fortran_FLAGS_RELEASE="$OPTFLAGS" \ -DCMAKE_C_COMPILER=mpicc \ -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \ -DQE_ENABLE_OPENMP=ON \ -DQE_FFTW_VENDOR=FFTW3 \ -DBLA_VENDOR=OpenBLAS .. $ make $ make install The resulting executables will be available in ~/.local/bin.Below is a sample Slurm script:#!/bin/bash #SBATCH --job-name=qe-cpu # create a short name for your job #SBATCH --nodes=1 # node count #SBATCH --ntasks-per-node=8 # number of tasks per node #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem=64G # total memory per node (4G per cpu-core is default) #SBATCH --time=00:15:00 # total run time limit (HH:MM:SS) #SBATCH --mail-type=begin # send email when job begins #SBATCH --mail-type=end # send email when job ends #SBATCH --mail-user=<YourNetID>@princeton.edu #SBATCH --constraint=skylake # exclude broadwell nodes module purge module load openmpi/gcc/4.1.2 module load fftw/gcc/3.3.9 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK srun $HOME/.local/bin/pw.x -input ausurf.in -npool 2 Make sure you perform a scaling analysis to find the optimal number of nodes, ntasks-per-node and cpus-per-task. Also, what value should be used for npool?The following was generated on Della in March 2022:nodesntasks-per-nodecpus-per-taskExecution time (s)CPU efficiency1321101996%184159538%The CPU efficiency was obtained from the "jobstats" command.TigerCPURun the commands below to install version 7.0:$ ssh <YourNetID>@tigercpu.princeton.edu $ mkdir -p software && cd software # or another location $ wget https://github.com/QEF/q-e/archive/refs/tags/qe-7.0.tar.gz $ tar zvxf qe-7.0.tar.gz $ cd q-e-qe-7.0 $ module purge $ module load rh/devtoolset/9 $ module load fftw/gcc/3.3.4 $ module load openmpi/gcc/3.1.5/64 $ OPTFLAGS="-O3 -march=native -DNDEBUG" $ ./configure FFLAGS="$OPTFLAGS" CFLAGS="$OPTFLAGS" --prefix=$HOME/.local --enable-parallel $ make pw $ make installDo not load the rh/devtoolset/9 module in your Slurm script. Tiger3When running the latest NGC container (quantum_espresso_qe-7.1.sif): 0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol) 0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol) 0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol) 0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol) 0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol) 0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol) 0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol) The above is probably occurring because the container does not support the H100 GPUs of tiger3.