AMD MI210 GPU Testing

Overview

The della-milan node features the AMD EPYC 7763 CPU (128 cores), 1 TB of RAM and 2 AMD MI210 GPUs. The Frontier supercomputer, which is the fastest machine in the US, features the MI250X GPU.

 

Connecting

If you have an account on the Della cluster and you have written to [email protected] for access to della-milan (you must be added to the video group) then you can connect to and use the node:

$ ssh <YourNetID>@della-milan.princeton.edu

The examples below will not work if you are not in the video group.

 

Getting Started

The software stack for AMD GPUs is called ROCm (Radeon Open Compute platforM or Radeon Open ECosystem). There is no environment module for this. You may consider adding the following to your ~/.bashrc file:

export PATH=/opt/rocm-5.4.1/bin:$PATH
export LD_LIBRARY_PATH=/opt/rocm-5.4.1/lib:$LD_LIBRARY_PATH

Here are the contents of the above directory:

$ ls -lL /opt/rocm-5.4.1/bin/
total 258276
-rwxr-xr-x. 1 root root    550920 Dec  6 21:55 amdclang
-rwxr-xr-x. 1 root root    550920 Dec  6 21:55 amdclang++
-rwxr-xr-x. 1 root root    550920 Dec  6 21:55 amdclang-cl
-rwxr-xr-x. 1 root root    550920 Dec  6 21:55 amdclang-cpp
-rwxr-xr-x. 1 root root    550920 Dec  6 21:55 amdflang
-rwxr-xr-x. 1 root root    550920 Dec  6 21:55 amdlld
-rwxr-xr-x. 1 root root     14529 Dec  6 21:58 aompcc
-rwxr-xr-x. 1 root root      4556 Dec  6 21:55 clang-ocl
-rwxr-xr-x. 1 root root     63072 Dec  6 21:57 clinfo
-rwxrwxr-x. 1 root root      2544 Dec  6 21:48 hipcc
-rwxrwxr-x. 1 root root      1508 Dec  6 21:48 hipcc_cmake_linker_helper
-rwxrwxr-x. 1 root root     27961 Dec  6 21:48 hipcc.pl
-rwxrwxr-x. 1 root root      2449 Dec  6 21:48 hipconfig
-rwxrwxr-x. 1 root root      8244 Dec  6 21:48 hipconfig.pl
-rwxr-xr-x. 1 root root       766 Dec  6 21:48 hipconvertinplace-perl.sh
-rwxr-xr-x. 1 root root       655 Dec  6 21:48 hipconvertinplace.sh
-rwxrwxr-x. 1 root root      1857 Dec  6 21:48 hipdemangleatp
-rwxr-xr-x. 1 root root       388 Dec  6 21:48 hipexamine-perl.sh
-rwxr-xr-x. 1 root root       538 Dec  6 21:48 hipexamine.sh
-rwxr-xr-x. 1 root root     15263 Dec  6 22:31 hipfc
-rwxr-xr-x. 1 root root  47783176 Dec  6 22:03 hipify-clang
-rwxr-xr-x. 1 root root    453755 Dec  6 21:48 hipify-perl
-rw-rw-r--. 1 root root      6076 Dec  6 21:48 hipvars.pm
-rwxr-xr-x. 1 root root      1328 Dec  6 22:25 install_precompiled_kernels.sh
-rwxr-xr-x. 1 root root   2587424 Dec  7 00:06 MIOpenDriver
-rwxr-xr-x. 1 root root      4800 Dec  6 21:58 mygpu
-rwxr-xr-x. 1 root root      4800 Dec  6 21:58 mymcpu
-rwxr-xr-x. 1 root root     20928 Dec  6 23:53 rocfft_rtc_helper
-rwxr-xr-x. 1 root root 209814240 Dec  6 22:00 rocgdb
-r-xr-xr-x. 1 root root      8844 Dec  6 21:56 rocm_agent_enumerator
-r-xr-xr-x. 1 root root     68840 Dec  6 21:56 rocminfo
-rwxrwxr-x. 1 root root    141197 Dec  6 21:49 rocm-smi
-rwxrwxr-x. 1 root root     10047 Dec  6 21:48 roc-obj
-rwxrwxr-x. 1 root root      8511 Dec  6 21:48 roc-obj-extract
-rwxrwxr-x. 1 root root      7054 Dec  6 21:48 roc-obj-ls
-r-xr-xr-x. 1 root root     19395 Dec  6 21:48 rocprof

Common Tools

"rocm-smi" is analogous to nvidia-smi. It is the AMD ROCm System Management Interface.

$ rocm-smi


======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
0    34.0c           39.0W   800Mhz  1600Mhz  0%   auto  300.0W    0%   0%    
1    31.0c           43.0W   800Mhz  1600Mhz  0%   auto  300.0W    0%   0%    
================================================================================
============================= End of ROCm SMI Log ==============================

 

HIP

HIP or "Heterogeneous-Compute Interface for Portability" provides a C++ syntax that is suitable for compiling most code. "hipcc" is the C++ compiler:

$ hipcc --version
HIP version: 5.4.22802-aaa1e3d8
AMD clang version 15.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.4.1 22465 d6f0fe8b22e3d8ce0f2cbd657ea14b16043018a5)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.4.1/llvm/bin

Hello World Example using HIP

$ mkdir test && cd test
$ wget https://raw.githubusercontent.com/ROCm-Developer-Tools/HIP-Examples/master/HIP-Examples-Applications/HelloWorld/Makefile
$ wget https://raw.githubusercontent.com/ROCm-Developer-Tools/HIP-Examples/master/HIP-Examples-Applications/HelloWorld/HelloWorld.cpp
$ make
$ ./HelloWorld

The "make" command can be done explicitly as follows:

$ hipcc HelloWorld.cpp -o HelloWorld
$ ./HelloWorld

Another example:

$ git clone https://github.com/ROCm-Developer-Tools/HIP.git
$ cd HIP/samples/0_Intro/square
$ make
$ ./square.out

Building LAMMPS from Source

#!/bin/bash
  
VERSION=22Dec2022
wget https://github.com/lammps/lammps/archive/refs/tags/patch_${VERSION}.tar.gz
tar zvxf patch_${VERSION}.tar.gz
cd lammps-patch_${VERSION}
mkdir build && cd build

module purge
export PATH=/opt/rocm-5.4.1/bin:$PATH
export LD_LIBRARY_PATH=/opt/rocm-5.4.1/lib:$LD_LIBRARY_PATH
export HIP_PLATFORM=amd

cmake3 -D BUILD_MPI=no -D BUILD_OMP=yes \
-D PKG_OPENMP=yes -D PKG_MOLECULE=yes -D PKG_RIGID=yes \
-D CMAKE_CXX_COMPILER=hipcc \
-D PKG_GPU=on -D HIP_PATH=/opt/rocm-5.4.1 -D GPU_API=HIP -D HIP_ARCH=gfx908 ../cmake

make -j 16
make install

The executable indicates that the GPU build was successful:

$ lmp -h | grep GPU
GPU package API: HIP
GPU package precision: mixed
Compatible GPU present: yes
GPU MOLECULE OPENMP RIGID

However, the code fails to run:

$ ~/.local/bin/lmp -in in.melt
LAMMPS (22 Dec 2022)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
ERROR: Unable to initialize accelerator for use (src/GPU/gpu_extra.h:65)
Last command: package         gpu 1

 

Useful Links

AMD ROCm: HIP Programming Guide

rocBLAS

 

Software Containers

Containers designed for AMD GPUs are available on the AMD Infinity Hub: https://www.amd.com/en/technologies/infinity-hub (read more). One can also find containers on Docker Hub. Applications include TensorFlow, PyTorch, LAMMPS, GROMACS, NAMD, CP2K and SPECFEM3D.

See our Singularity page and the ROCm section of the user manual for running, for example:

$ singularity pull singularity pull docker://rocm/tensorflow:rocm5.4.1-tf2.10-dev
$ singularity run --rocm tensorflow_rocm5.4.1-tf2.10-dev.sif
Singularity> python
>>> from tensorflow.python.client import device_lib
>>> print(device_lib.list_local_devices())

To run the MNIST example:

$ git clone https://github.com/PrincetonUniversity/slurm_mnist
$ cd slurm_mnist
$ singularity exec --rocm $HOME/software/tensorflow_rocm5.4.1-tf2.10-dev.sif python3 download_mnist.py
$ singularity exec --rocm $HOME/software/tensorflow_rocm5.4.1-tf2.10-dev.sif python3 mnist_classify.py

Below is a TensorFlow code and benchmarks:

import tensorflow as tf

cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()

model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=5, batch_size=64)

 

GPU Seconds per epoch Speed-up relative to V100
MI210 17 0.9
A100 9 1.8
V100 16 1.0

 

TensorFlow with Conda/pip

The commands below can be used to install TensorFlow for ROCm:

$ module load anaconda3/2022.5
$ conda create --name tf-rocm python=3.9
$ conda activate tf-rocm
$ pip install tensorflow-rocm
$ export LD_LIBRARY_PATH=/opt/rocm-5.4.1/lib:$LD_LIBRARY_PATH
$ python
>>>> import tensorflow as tf

 

GROMACS

One can launch GROMACS on the AMD GPU with:

$ singularity pull docker://amdih/gromacs:2022.3.amd1_174
$ wget ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz
$ tar zxvf rnase_bench_systems.tar.gz
$ cd rnase_cubic
$ singularity exec --rocm ../gromacs_2022.3.amd1_174.sif gmx grompp -f pme_verlet.mdp -c conf.gro -p topol.top -o bench.tpr
$ singularity exec --rocm ../gromacs_2022.3.amd1_174.sif gmx mdrun -nsteps 100000 -ntmpi 1 -ntomp 16 -update gpu -s bench.tpr -pin on
GPU ns/day Speed-up relatative to V100
MI210 250 0.4
A100 950 1.4
V100 700 1.0

The A100 and V100 numbers were obtained using Adroit and this build. The number of CPU-cores was varied in all cases to find the optimal number.