Overview
The della-milan node features the AMD EPYC 7763 CPU (128 cores), 1 TB of RAM and 2 AMD MI210 GPUs. The Frontier supercomputer, which is the fastest machine in the US, features the MI250X GPU.
Connecting
If you have an account on the Della cluster and you have written to [email protected] for access to della-milan (you must be added to the video group) then you can connect to and use the node:
$ ssh <YourNetID>@della-milan.princeton.edu
The examples below will not work if you are not in the video group.
Getting Started
The software stack for AMD GPUs is called ROCm (Radeon Open Compute platforM or Radeon Open ECosystem). There is no environment module for this. You may consider adding the following to your ~/.bashrc file:
export PATH=/opt/rocm-5.4.1/bin:$PATH export LD_LIBRARY_PATH=/opt/rocm-5.4.1/lib:$LD_LIBRARY_PATH
Here are the contents of the above directory:
$ ls -lL /opt/rocm-5.4.1/bin/ total 258276 -rwxr-xr-x. 1 root root 550920 Dec 6 21:55 amdclang -rwxr-xr-x. 1 root root 550920 Dec 6 21:55 amdclang++ -rwxr-xr-x. 1 root root 550920 Dec 6 21:55 amdclang-cl -rwxr-xr-x. 1 root root 550920 Dec 6 21:55 amdclang-cpp -rwxr-xr-x. 1 root root 550920 Dec 6 21:55 amdflang -rwxr-xr-x. 1 root root 550920 Dec 6 21:55 amdlld -rwxr-xr-x. 1 root root 14529 Dec 6 21:58 aompcc -rwxr-xr-x. 1 root root 4556 Dec 6 21:55 clang-ocl -rwxr-xr-x. 1 root root 63072 Dec 6 21:57 clinfo -rwxrwxr-x. 1 root root 2544 Dec 6 21:48 hipcc -rwxrwxr-x. 1 root root 1508 Dec 6 21:48 hipcc_cmake_linker_helper -rwxrwxr-x. 1 root root 27961 Dec 6 21:48 hipcc.pl -rwxrwxr-x. 1 root root 2449 Dec 6 21:48 hipconfig -rwxrwxr-x. 1 root root 8244 Dec 6 21:48 hipconfig.pl -rwxr-xr-x. 1 root root 766 Dec 6 21:48 hipconvertinplace-perl.sh -rwxr-xr-x. 1 root root 655 Dec 6 21:48 hipconvertinplace.sh -rwxrwxr-x. 1 root root 1857 Dec 6 21:48 hipdemangleatp -rwxr-xr-x. 1 root root 388 Dec 6 21:48 hipexamine-perl.sh -rwxr-xr-x. 1 root root 538 Dec 6 21:48 hipexamine.sh -rwxr-xr-x. 1 root root 15263 Dec 6 22:31 hipfc -rwxr-xr-x. 1 root root 47783176 Dec 6 22:03 hipify-clang -rwxr-xr-x. 1 root root 453755 Dec 6 21:48 hipify-perl -rw-rw-r--. 1 root root 6076 Dec 6 21:48 hipvars.pm -rwxr-xr-x. 1 root root 1328 Dec 6 22:25 install_precompiled_kernels.sh -rwxr-xr-x. 1 root root 2587424 Dec 7 00:06 MIOpenDriver -rwxr-xr-x. 1 root root 4800 Dec 6 21:58 mygpu -rwxr-xr-x. 1 root root 4800 Dec 6 21:58 mymcpu -rwxr-xr-x. 1 root root 20928 Dec 6 23:53 rocfft_rtc_helper -rwxr-xr-x. 1 root root 209814240 Dec 6 22:00 rocgdb -r-xr-xr-x. 1 root root 8844 Dec 6 21:56 rocm_agent_enumerator -r-xr-xr-x. 1 root root 68840 Dec 6 21:56 rocminfo -rwxrwxr-x. 1 root root 141197 Dec 6 21:49 rocm-smi -rwxrwxr-x. 1 root root 10047 Dec 6 21:48 roc-obj -rwxrwxr-x. 1 root root 8511 Dec 6 21:48 roc-obj-extract -rwxrwxr-x. 1 root root 7054 Dec 6 21:48 roc-obj-ls -r-xr-xr-x. 1 root root 19395 Dec 6 21:48 rocprof
Common Tools
"rocm-smi" is analogous to nvidia-smi. It is the AMD ROCm System Management Interface.
$ rocm-smi ======================= ROCm System Management Interface ======================= ================================= Concise Info ================================= GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 34.0c 39.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0% 1 31.0c 43.0W 800Mhz 1600Mhz 0% auto 300.0W 0% 0% ================================================================================ ============================= End of ROCm SMI Log ==============================
HIP
HIP or "Heterogeneous-Compute Interface for Portability" provides a C++ syntax that is suitable for compiling most code. "hipcc" is the C++ compiler:
$ hipcc --version HIP version: 5.4.22802-aaa1e3d8 AMD clang version 15.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.4.1 22465 d6f0fe8b22e3d8ce0f2cbd657ea14b16043018a5) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /opt/rocm-5.4.1/llvm/bin
Hello World Example using HIP
$ mkdir test && cd test $ wget https://raw.githubusercontent.com/ROCm-Developer-Tools/HIP-Examples/master/HIP-Examples-Applications/HelloWorld/Makefile $ wget https://raw.githubusercontent.com/ROCm-Developer-Tools/HIP-Examples/master/HIP-Examples-Applications/HelloWorld/HelloWorld.cpp $ make $ ./HelloWorld
The "make" command can be done explicitly as follows:
$ hipcc HelloWorld.cpp -o HelloWorld $ ./HelloWorld
Another example:
$ git clone https://github.com/ROCm-Developer-Tools/HIP.git $ cd HIP/samples/0_Intro/square $ make $ ./square.out
Building LAMMPS from Source
#!/bin/bash VERSION=22Dec2022 wget https://github.com/lammps/lammps/archive/refs/tags/patch_${VERSION}.tar.gz tar zvxf patch_${VERSION}.tar.gz cd lammps-patch_${VERSION} mkdir build && cd build module purge export PATH=/opt/rocm-5.4.1/bin:$PATH export LD_LIBRARY_PATH=/opt/rocm-5.4.1/lib:$LD_LIBRARY_PATH export HIP_PLATFORM=amd cmake3 -D BUILD_MPI=no -D BUILD_OMP=yes \ -D PKG_OPENMP=yes -D PKG_MOLECULE=yes -D PKG_RIGID=yes \ -D CMAKE_CXX_COMPILER=hipcc \ -D PKG_GPU=on -D HIP_PATH=/opt/rocm-5.4.1 -D GPU_API=HIP -D HIP_ARCH=gfx908 ../cmake make -j 16 make install
The executable indicates that the GPU build was successful:
$ lmp -h | grep GPU GPU package API: HIP GPU package precision: mixed Compatible GPU present: yes GPU MOLECULE OPENMP RIGID
However, the code fails to run:
$ ~/.local/bin/lmp -in in.melt LAMMPS (22 Dec 2022) OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98) using 1 OpenMP thread(s) per MPI task ERROR: Unable to initialize accelerator for use (src/GPU/gpu_extra.h:65) Last command: package gpu 1
Useful Links
AMD ROCm: HIP Programming Guide
Software Containers
Containers designed for AMD GPUs are available on the AMD Infinity Hub: https://www.amd.com/en/technologies/infinity-hub (read more). One can also find containers on Docker Hub. Applications include TensorFlow, PyTorch, LAMMPS, GROMACS, NAMD, CP2K and SPECFEM3D.
See our Singularity page and the ROCm section of the user manual for running, for example:
$ singularity pull singularity pull docker://rocm/tensorflow:rocm5.4.1-tf2.10-dev $ singularity run --rocm tensorflow_rocm5.4.1-tf2.10-dev.sif Singularity> python >>> from tensorflow.python.client import device_lib >>> print(device_lib.list_local_devices())
To run the MNIST example:
$ git clone https://github.com/PrincetonUniversity/slurm_mnist $ cd slurm_mnist $ singularity exec --rocm $HOME/software/tensorflow_rocm5.4.1-tf2.10-dev.sif python3 download_mnist.py $ singularity exec --rocm $HOME/software/tensorflow_rocm5.4.1-tf2.10-dev.sif python3 mnist_classify.py
Below is a TensorFlow code and benchmarks:
import tensorflow as tf cifar = tf.keras.datasets.cifar100 (x_train, y_train), (x_test, y_test) = cifar.load_data() model = tf.keras.applications.ResNet50( include_top=True, weights=None, input_shape=(32, 32, 3), classes=100) loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"]) model.fit(x_train, y_train, epochs=5, batch_size=64)
GPU | Seconds per epoch | Speed-up relative to V100 |
---|---|---|
MI210 | 17 | 0.9 |
A100 | 9 | 1.8 |
V100 | 16 | 1.0 |
TensorFlow with Conda/pip
The commands below can be used to install TensorFlow for ROCm:
$ module load anaconda3/2022.5 $ conda create --name tf-rocm python=3.9 $ conda activate tf-rocm $ pip install tensorflow-rocm $ export LD_LIBRARY_PATH=/opt/rocm-5.4.1/lib:$LD_LIBRARY_PATH $ python >>>> import tensorflow as tf
GROMACS
One can launch GROMACS on the AMD GPU with:
$ singularity pull docker://amdih/gromacs:2022.3.amd1_174 $ wget ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz $ tar zxvf rnase_bench_systems.tar.gz $ cd rnase_cubic $ singularity exec --rocm ../gromacs_2022.3.amd1_174.sif gmx grompp -f pme_verlet.mdp -c conf.gro -p topol.top -o bench.tpr $ singularity exec --rocm ../gromacs_2022.3.amd1_174.sif gmx mdrun -nsteps 100000 -ntmpi 1 -ntomp 16 -update gpu -s bench.tpr -pin on
GPU | ns/day | Speed-up relatative to V100 |
---|---|---|
MI210 | 250 | 0.4 |
A100 | 950 | 1.4 |
V100 | 700 | 1.0 |
The A100 and V100 numbers were obtained using Adroit and this build. The number of CPU-cores was varied in all cases to find the optimal number.