The Traverse cluster is primarily intended to support research at the Princeton Plasma Physics Lab (PPPL). Traverse is also available to Princeton researchers whose work is particularly suited to the architecture of this system either because it is very similar to the Summit cluster at Oak Ridge National Laboratory or because the application to be run can take particular advantage of the NVLink architecture. Programs that move a lot of data in or out of the GPU should see an especially large speed up.
Some Technical Specifications:
The Traverse cluster is composed of 46 IBM POWER9 nodes with four NVIDIA V100 GPUs per node.
How to Access the Traverse Cluster
To use the Traverse cluster you must obtain an account and then log in through SSH.
- Requesting Access to Traverse
Traverse is 80% owned by PPPL and the remaining 20% was purchased with funds from a small number of Princeton faculty. There is no public portion to the cluster so accounts for non-PPPL research groups are generally not granted.
PPPL members should write to firstname.lastname@example.org to request an account. Please include a few sentences about the nature of your code and whether or not it is GPU-enabled. PU members who are part of a research group that has contributed to or has an approved project on Traverse can obtain an account by having the faculty member send a request to email@example.com. Any non-Princeton user must be sponsored by a Princeton or PPPL PI for a Research Computer User (RCU) account.
- Logging into Traverse
Once you have been granted access to Traverse, you can connect via SSH (VPN required from off-campus):
$ ssh <YourNetID>@traverse.princeton.eduFor more on how to SSH, see the Knowledge Base article Secure Shell (SSH): Frequently Asked Questions (FAQ). If you have trouble connecting then see our SSH page.
How to Use the Traverse Cluster
Since Traverse is a Linux system, knowing some basic Linux commands is highly recommended. For an introduction to navigating a Linux system, view the material associated with our Intro to Linux Command Line workshop.
Using Traverse also requires some knowledge on how to properly use the file system, module system, and how to use the scheduler that handles each user's jobs. For an introduction to navigating Princeton's High Performance Computing systems, view the material associated with our Getting Started with the Research Computing Clusters workshop. Additional information specific to Traverse's file system, priority for job scheduling, etc. can be found below.
Please remember that these are shared resources for all users.
The system head node, traverse, should be used for interactive work only, such as compiling programs and submitting jobs as described below. No jobs should be run on the head node other than brief tests that last no more than a few minutes. Where practical, we ask that you entirely fill the nodes so that CPU core fragmentation is minimized. All jobs must be run through the SLURM scheduler.
All jobs must be run through the scheduler on Traverse.
Jobs are prioritized through the Slurm scheduler based on a number of factors: job size, run times, node availability, wait times, and percentage of usage over a 30 day period as well as a fairshare mechanism to provide access for large contributors. The policy below may change as the job mix changes on the machine.
Jobs will move to the test, short, medium or long quality of service as determined by the scheduler. They are differentiated by the wallclock time requested as follows:
|QOS||Time Limit||Jobs per user||Cores per Job||Cores Available|
|test||1 hour||2 jobs||no limit||no limit|
|short||4 hours||10 jobs||no limit||no limit|
|medium||24 hours||6 jobs||3072 cores||5888 cores/120 GPUs|
|4 jobs||no limit||2944 cores/92 GPUs|
In most cases, these are the maximum numbers and limits may be placed if demand requires. Use the "qos" command to view the actual values in effect.
There is also a special system reservation available for the "pppl" group of 4 nodes. This is set from 9 AM - 5 PM on weekdays and should be used for quick testing of code and not for any production runs. To use this add the --reservation=test flag to your job script.
Wording of Acknowledgement of Support and/or Use of Research Computing Resources
"The author(s) are pleased to acknowledge that the work reported on in this paper was substantially performed using the Princeton Research Computing resources at Princeton University which is consortium of groups led by the Princeton Institute for Computational Science and Engineering (PICSciE) and Office of Information Technology's Research Computing."
"The simulations presented in this article were performed on computational resources managed and supported by Princeton Research Computing, a consortium of groups including the Princeton Institute for Computational Science and Engineering (PICSciE) and the Office of Information Technology's High Performance Computing Center and Visualization Laboratory at Princeton University."
TRAVERSE USER GUIDE
The Traverse supercomputer is located in Princeton University's HPCRC data center. Eighty percent of the cluster is reserved for PPPL research while the balance belongs to a small number of research groups on main campus.
Click on a link below to quickly jump to a section:
• NVLink | GPUDirect | Tensor Cores
• Python | Jupyter | Machine Learning
• GPU-Accelerated Libraries | OpenACC | CUDA
• GNU GCC | NVIDIA HPC SDK | PGI | IBM XL
• NVIDIA Math | MAGMA | ESSL | PETSc
Debuggers and Profilers
• Slurm | Reservation Queue | Simultaneous Multithreading | CUDA Multi-Process Service
Traverse consists of:
- 46 IBM AC922 POWER9 nodes, with each node having:
- 2 IBM POWER9 processors (sockets)
- 16 cores per processor
- 4 hardware threads per core
- 32 cores (128 hardware threads) per node
- 256 GB of RAM per node
- 4 NVIDIA V100 GPUs (2 per socket) with 32 GB of memory each
- 3.2 TB NVMe (solid state) local storage (not shared between nodes)
- EDR InfiniBand (100 Gb/s bi-directional per port), 1:1 per rack, 2:1 rack-to-rack interconnect
- GPFS high-performance parallel scratch storage
- Globus data transfer node (10 GbE external, EDR to storage)
- 2 IBM POWER9 processors (sockets)
- InfiniBand Network
- EDR InfiniBand (100 Gb/s)
- Fully non-blocking (1:1) within a chassis, 2:1 oversubscription between chassis
- Home directories (/home)
- 5 TB total space
- User quota: 10 GB (request more)
- Uses InfiniBand (using IP over InfiniBand, or IPoIB)
- backed up
- Scratch space (/scratch/gpfs)
- GPFS parallel filesystem
- 2.9 PB
- User quota: 500 GB
- Uses InfiniBand
- NOT BACKED UP
- Local scratch (/tmp)
- NVMe drive per compute node (3.2 TB)
- See this page for usage
- NOT BACKED UP
- Home directories (/home)
The head node traverse.princeton.edu can be used for compiling codes, running short tests, submitting jobs, etc. Make sure that you do not use more than 10% of the machine (cores and memory) for more than 10 minutes at a time since it is shared by all users. There are two V100 GPUs on the head node.
Traverse was upgraded to the RHEL8 operating system in September of 2020.
Traverse provides an on-ramp to the Summit supercomputer at the Oak Ridge National Laboratory, which is also composed of POWER9/V100 nodes. Summit has 150 times the number of GPUs of Traverse. You can learn a lot about Traverse by reading the Summit user guide.
The Traverse nodes are arranged in racks of 12 nodes. Compute node traverse-k04g5 is node 5 in rack 4.
The High Performance LINPACK (HPL) benchmark was used to measure the performance of Traverse. The theoretical peak of the system is about 1.3 petaflops and HPL measured 1.1 petaflops. Note that 97% of the compute power of Traverse comes from the NVIDIA V100 Volta GPUs. Read an article about the debut of Traverse in 2019.
To request access to Traverse, please email firstname.lastname@example.org and include a brief description of your code and whether or not it is GPU-enabled. Note that Prentice Bisbal or Stephane Ethier will need to approve the request.
Once you have been given access to Traverse, you can log in with the following command:
$ ssh <YourNetID>@traverse.princeton.edu
The command above will work from any system on the PPPL network. Traverse uses the university's central authentication system (CAS), so you log in using your PU NetID and its associated password, followed by a challenge from the DUO authentication system. If you have never used DUO to access the systems on campus, please refer to the instructions on this page. Also, if you have never logged into the PU Linux systems, or haven't recently, you may need to request Unix access for your NetID.
NOTE: There is a way to authenticate only once with DUO during a session. See the Multiplexing Approach in these instructions.
Accessing Traverse outside of PPPL and Princeton University Networks
If connecting to Traverse from a location outside of the PPPL or Princeton University networks, a VPN connection is required. You can use either the PPPL "Secure Pulse" connection to the PPPL network or the Princeton University "Secure Remote Access (SRA)" connection. See these instructions to use the Princeton University VPN.
Each node has 2 POWER9 CPUs at 2.7 GHz. Each CPU is composed of 16 cores where each core supports 4 hardware threads. Slurm allows jobs with up to 128 tasks per node. Note that many applications will run faster when only using 1 hardware thread per core. To configure this see the Simultaneous Multithreading section below. There is 256 GB of RAM per node.
Below is a schematic diagram of a single node of Traverse:
Information about the CPU is available from the lscpu command:
$ lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 4 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 6 Model: 2.3 (pvr 004e 1203) Model name: POWER9, altivec supported CPU max MHz: 3800.0000 CPU min MHz: 2300.0000 L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 10240K NUMA node0 CPU(s): 0-63 NUMA node8 CPU(s): 64-127 NUMA node252 CPU(s): NUMA node253 CPU(s): NUMA node254 CPU(s): NUMA node255 CPU(s):
We see from the above that there is a NUMA node associated with each CPU and GPU. To learn more about the NUMA nodes run this command: numactl -H
As shown in the schematic diagram above, each node has 4 NVIDIA V100 GPUs. Each GPU has 32 GB of memory. The transfer speed between the GPU and its memory is about 800 GB/s. Note that there are 6 channels (or 3 bricks) for the NVLink giving 75 GB/s per direction (150 GB/s bi-directional). Summit is limited to only 50 GB/s per direction. To see the topology run: nvidia-smi topo -m
Each V100 GPU contains 80 streaming multiprocessors (SM). Below is a diagram of an SM on Traverse:
NVLink is the term used described the fast interconnect between GPU-to-GPU and GPU-to-CPU on Traverse. As shown in the schematic diagram above one can achieve transfer rates of 150 GB/s. This is much faster than the limit of 16 GB/s on TigerGPU.
Using GPUDirect, multiple GPUs, network adapters, solid-state drives and NVMe drives can directly read and write CUDA host and device memory, eliminating unnecessary memory copies, dramatically lowering CPU overhead, and reducing latency, resulting in significant performance improvements in data transfer times for applications running on NVIDIA Tesla GPUs.
The V100 GPUs have 640 Tensor Cores (8 per streaming multiprocessor) where half-precision (16 bits FP16) Warp Matrix-Matrix and Accumulate (WMMA) operations can be carried out. That is, each Tensor Core can multiply two 4 x 4 matrices together in half-precision and add the result to a third matrix which is in full precision. This is useful for training and inference on deep neural networks and many other computations that are rooted in linear algebra.
There are several use cases where the Tensor Cores can be utilized on the V100 GPUs of Traverse. In general it is algorithms that use Level 3 BLAS routines. In almost all cases the user needs to explicitly take action to use the Tensor Cores.
The NVIDIA Apex library allows for automatic mixed-precision (AMP) training and distributed training of neural networks. It is included with an installation of PyTorch from WML-CE. To see the performance benefit of the Tensor Cores, download the dcgan example and run it with and without using the Tensor Cores. Using 16 hardware threads one finds a speed-up of about 10%. Note that to use the fp16 kernels the dimension of each matrix must be a multiple of 8. Read about the constraints here.
Another example using Fortran is here. There are algorithms in the MAGMA library (discussed below) that can utilize the Tensor Cores of V100 GPUs. Mixed precision Krylov and Multigrid solvers have also been developed, as discussed in this presentation.
NVIDIA has introduced a larger number and different types of Tensor Cores in the A100 GPU. Additionally, in many cases the Tensor Cores are automatically used and many of the constraints have been relaxed. There are no Tensor Cores on the P100 GPUs on TigerGPU.
There are two locations where you can store your files: /home and /scratch/gpfs. Home directories are on an NFS filesystem and are backed-up. /scratch/gpfs is a high-performance GPFS parallel filesystem where you should run your simulations, and it is NOT backed up. Note that /scratch/gpfs is shared with Stellar. As indicated in the figure below. Home directories have a user quota of 10 GB (request an increase). Your space on /scratch/gpfs has user quota of 500 GB. When your account on Traverse is created, a directory named /scratch/gpfs/<YourNetID> is created for you, in addition to your home directory.
For PPPL users: Please note that directories on Traverse are named /home/<username>, which differs from PPPL's conventions of /u/<username>. If you have the full path to your home directory hard-coded in any scripts you plan on running on Traverse, please be sure to modify them to use this different path. It is best to use the environment variable $HOME to refer to your home directory, since that is much more portable.
GPFS stands for General Parallel File System. GPFS is a high-performance parallel filesystem that provides much higher IOPS and bandwidth than a non-parallel filesystem. Due to its parallel performance and larger quota, /scratch/gpfs should be used for all file I/O for jobs running on Traverse. That means all input data should be copied to /scratch/gpfs before you start your job, and your job should write all its results to /scratch/gpfs. When your job is done, results can be compressed and copied over to your home directory, and/or copied back to PPPL for long-term storage. Only Traverse users who are also members of a PU research group will have read/write access to /tigress and /projects.
Each compute node has an NVMe (non-nolatile memory express) drive with 3.2 TB capacity for fast reads and writes. These drives are local to each node. The path to the NVMe drive is /tmp. Note that all files written to /tmp are removed when the job finishes. Therefore you must copy any files from /tmp to a /scratch/gpfs before the job finishes. For more information on using the NVMe drives see this page. On Traverse, one should find that a 12 GB file can be copied from /scratch/gpfs to /tmp on a compute node in about 2 seconds. A much longer time is needed to copy the same file to /home. Note that one may not see fast writes to /tmp if your application writes in small chunks or even line-by-line. /tmp is an alias for /scratch. The only difference between the two is that If you write to /scratch then your files will not be deleted after the job finishes. This is not preferred. Please write to /tmp.
Currently, the best way to move a small amount of data back and forth between PPPL and Traverse is to use scp, rsync over ssh, or bbcp (official bbcp documentation, more user-friendly documentation). To transfer a large amount of data, a dedicated Globus endpoint is available. The name of the endpoint is "Princeton Traverse/Stellar Scratch DTN" and, as its name implies, it has direct access to the data on the /scratch/gpfs filesystem on Traverse.
The software environment on Traverse is very similar to the other Research Computing clusters such as Tiger. See the general documentation for Princeton University Research Computing.
If you find that you need software packages that are not installed on Traverse then please send a request via e-mail to email@example.com. Please note that commercial applications are not always available for the POWER architecture (e.g., MATLAB).
The Anaconda Python distribution should be used when working with Python on Traverse:
$ module avail anaconda3 $ module load anaconda3/2020.7 $ python --version
$ module load anaconda/2019.10 $ python --version
See this page for more information on using the Anaconda Python distribution on the Research Computing clusters. One may also consider installing Anaconda or Miniconda (see "Other Resources") for POWER9. There are many useful Anaconda packages in the IBM Watson Machine Learning Community Edition channel.
The system Python is available if needed. This can be useful for some tasks such as building codes:
$ python -bash: python: command not found $ python3 Python 3.6.8 (default, Dec 5 2019, 16:11:43) [GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> $ python2 Python 2.7.17 (default, Oct 30 2019, 17:39:41) [GCC 8.3.1 20190507 (Red Hat 8.3.1-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>
In general, for scientific work one wants to use the Anaconda Python distribution which is described above.
To run a Jupyter Notebook or JupyterLab on the Traverse head node follow the directions under "Running on Tigressdata" on this page while substituting "traverse" for "tigressdata". There are also directions for running on a compute node. If using a VPN is not an option then use the directions under "Avoiding Using a VPN from Off-Campus".
There are many useful Anaconda packages in the IBM Watson Machine Learning Community Edition channel. Here is a partial list of popular packages:
- PyTorch (built with Large Model Support and Apex)
- TensorFlow (built with the Distributed Deep Learning package)
- NVIDIA RAPIDS (cuDF, cuML, dask-cuDF or install powerai-rapids to get all packages)
- Snap ML (conventional ML models with API similar to Scikit-Learn; Apache Spark and MPI APIs)
- TensorRT for GPU-accelerated inference
Research Computing also maintains dedicated documentation for TensorFlow and PyTorch. As those pages note, if you need a newer version of the software found in the IBM WML-CE channel then consider using the early access channel: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
To get started with RAPIDS, create a conda environment:
$ ssh <YourNetID>@traverse.princeton.edu $ module load anaconda3 $ CHNL=https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda $ conda create --name rapids-env --channel $CHNL cudf cuml # accept the license agreement
To take advantage of the GPUs on Traverse, one needs the CUDA package for Julia, which requires Julia 1.3 or later. Here is a procedure to build version 1.5:
$ cd /scratch/gpfs/<YourNetID> $ wget https://github.com/JuliaLang/julia/releases/download/v1.5.2/julia-1.5.2-full.tar.gz $ tar xvf julia-1.5.2-full.tar.gz $ cd julia-1.5.2 # create a Make.user file containing these 3 lines: USE_BINARYBUILDER:=0 USE_BINARYBUILDER_LLVM:=1 JULIA_PRECOMPILE:=0 $ make $ prefix=$HOME/local/julia/1.5.2 # or choose a different install location $ make prefix=$prefix install $ export PATH=$prefix/bin:$PATH
Other software can be seen by running the module avail command. There is a small number of software packages in the /opt directory.
There are several ways to write a code that will run on GPUs. Here is a list of the most widely used methods, from the easiest to implement to the most difficult (but most powerful).
There are many "CUDA-based" libraries that can be used to run parts of your code on a GPU. It can be as simple as calling a library function or routine. Just make sure that the section of code that you put on the GPU is compute intensive otherwise you will not see a speedup. See the Numerical Libraries section below.
OpenACC is a "directive-based" programming model designed for scientists and engineers interested in porting their codes to a wide-variety of heterogeneous HPC hardware platforms and architectures with significantly less programming effort than required with a low-level model (such as CUDA). In practice, OpenACC is mainly used for porting Fortran, C, and C++ codes to GPUs. Directives are special comments (Fortran) or preprocessor pragmas (C/C++) that instruct the compiler to generate GPU instructions for given sections of a code. The best OpenACC compiler is PGI/Nvidia SDK . gcc also supports it but the GPU code is usually not as performant. Here is an example of OpenACC usage:
program laplace implicit none integer, parameter :: fp_kind=kind(1.0d0) integer, parameter :: n=1024, m=1024, iter_max=1000 integer :: i, j, iter real(fp_kind), dimension (:,:), allocatable :: A, Anew real(fp_kind) :: tol=1.0e-6_fp_kind, error=1.0_fp_kind real(fp_kind) :: start_time, stop_time allocate ( A(0:n-1,0:m-1), Anew(0:n-1,0:m-1) ) A = 0.0_fp_kind Anew = 0.0_fp_kind ! Set B.C. A(0,:) = 1.0_fp_kind Anew(0,:) = 1.0_fp_kind write(*,'(a,i5,a,i5,a)') 'Jacobi relaxation Calculation:', n, ' x', m, ' mesh' call cpu_time(start_time) iter=0 !$acc data copyin(Anew), copy(A) do while ( error .gt. tol .and. iter .lt. iter_max ) error=0.0_fp_kind !$acc kernels do j=1,m-2 do i=1,n-2 Anew(i,j) = 0.25_fp_kind * ( A(i+1,j ) + A(i-1,j ) + & A(i ,j-1) + A(i ,j+1) ) error = max( error, abs(Anew(i,j)-A(i,j)) ) end do end do !$acc end kernels if(mod(iter,100).eq.0 ) write(*,'(i5,f10.6)'), iter, error iter = iter + 1 !$acc kernels do j=1,m-2 do i=1,n-2 A(i,j) = Anew(i,j) end do end do !$acc end kernels end do !$acc end data call cpu_time(stop_time) write(*,'(a,f10.3,a)') ' completed in ', stop_time-start_time, ' seconds' deallocate (A,Anew) end program laplace
Here is a link to a short OpenACC tutorial.
You should make every effort to take advantage of the GPU-enabled libraries from NVIDIA and other vendors to accelerate your code. If the APIs are too rigid and they do not fit your needs then consider OpenACC as described above. Finally, there is the option of writing custom GPU kernels from scratch in C++ or Fortran. See this workshop for an introduction to CUDA at Princeton.
The system version of GCC is 8.3.1. For instance:
$ gcc --version gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5) $ g++ --version g++ (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5) $ gfortran --version GNU Fortran (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)
A good starting point for GCC optimization flags on Traverse is:
$ gcc -Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG -o myprog myprog.c
Take a look at man gcc for more. While you should prefer the system version of GCC (8.3.1), in some cases it may be necessary to use an earlier version such as for codes that require older versions of the CUDA Toolkit and PGI compiler. The rh/devtoolset/7 environment module exists for this purpose:
$ module load rh/devtoolset/7 $ gcc --version gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5) $ g++ --version g++ (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5) $ gfortran --version GNU Fortran (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
NVIDIA acquired PGI in 2013. There will be no future releases of PGI compilers. Instead NVIDIA offers the HPC SDK which is a collection of compilers, libraries and tools supporting multiple programming models on multicore CPUs and GPU nodes. See the documentation. The HPC SDK provides the recommended compilers for OpenACC codes.
- nvcc is the C++ compiler for CUDA kernels
- nvc is the C compiler
- nvc++ is the C++ compiler
- nvfortran is the Fortran compiler (it can generates GPU code automatically for the V100)
NVIDIA recommends using NVSHMEM instead of Open MPI when developing a parallel code for GPU nodes. One can also replace slow MPI calls with the appropriate routine in NVSHMEM.
Traverse provides the NVIDIA HPC SDK as a module in two variants:
$ module avail nvhpc --- /opt/share/Modules/modulefiles --- nvhpc-nocompiler/20.7 nvhpc/20.7
The V100 GPU has a compute capability of 7.0 so use these flags to compile a kernel:
$ module load nvhpc $ nvcc -O3 -arch=sm_70 --use_fast_math -o mykernel mykernel.cu
To compile an OpenACC code written in C:
$ module load nvhpc $ nvc -O3 -acc -gpu=cc70 -Minfo -Mneginfo -o laplace laplace2d.c
Note that the CUDAToolkit is included in the nvhpc.
There will be no future releases of the PGI compilers. Users should favor their replacements which are available in the NVIDIA HPC SDK (see above). Here are the available PGI modules:
$ module avail pgi --- /opt/share/Modules/modulefiles --- pgi/19.5/64 pgi/19.9/64 pgi/20.4/64
Also, we've made changes to the CUDA Toolkit 10.2 to allow use with PGI 20.4. To do this you will have to add the follow define flag to the nvcc compile line (e.g., for building PETSc one would do that by adding it to CUDAFLAGS):
We have the community edition (version 16.1.1) of the IBM XL compilers. This version was released in December 2018. While it offers several GPU features, it can only go as high as version 10.1 of CUDA. See this video for an overview.
Users should favor GCC or one of the other compilers over XL. A newer, paid version with additional optimizations is available.
To get started:
$ xlc --version IBM XL C/C++ for Linux, V16.1.1 (Community Edition) Version: 16.01.0001.0003 $ xlc -qhelp $ xlC -qhelp $ xlf -qhelp
A good starting point for XL optimization flags on Traverse is:
$ xlc -Ofast -qarch=pwr9 -qtune=pwr9 -qsimd=auto -DNDEBUG -o myprog myprog.c
For more on optimization see Code Optimization with IBM XL Compilers. This guide says: "VMX and VSX machine instructions can execute up to sixteen operations in parallel."
The Intel compilers cannot be used on the POWER architecture of Traverse.
In addition to traditional builds of the MPI library, Traverse also offers CUDA-aware MPI builds which allow for data on a GPU to be sent to another GPU without going through a CPU. According to NVIDIA, regular MPI implementations pass pointers to host memory, staging GPU buffers through host memory using cudaMemcopy. With CUDA-aware MPI, the MPI library can send and receive GPU buffers directly, without having to first stage them in host memory. To see the CUDA-aware MPI modules:
$ module avail openmpi/cuda
A simple code that uses CUDA-aware MPI is here. In the figure below, RDMA is remote direct memory access.
NVIDIA offers a range of GPU-accelerated math libraries. They are designed as drop-in replacements for commonly used CPU-only libraries making it easy to incorporate them in your code. Here is a list of selected libraries:
- cuDNN - GPU-accelerated library of primitives for deep neural networks
- cuBLAS - GPU-accelerated standard BLAS library
- cuSPARSE - GPU-accelerated BLAS for sparse matrices
- cuRAND - GPU-accelerated random number generation (RNG)
- cuSOLVER - Dense and sparse direct solvers for computer vision, CFD and other applications
- cuFFT - GPU-accelerated library for Fast Fourier Transforms
- NPP - GPU-accelerated image, video, and signal processing functions
- NCCL - Collective Communications Library for scaling apps across multiple GPUs and nodes
- nvGRAPH - GPU-accelerated library for graph analytics
For the complete list see GPU libraries by NVIDIA. You can inspect the available libraries for a given CUDA Toolkit version like so:
$ ls -lL /usr/local/cuda-11.1/lib64/lib*.so
Note that NVIDIA moved cuBLAS out of the toolkit in version 10 (to /usr/lib64 on our system) and then moved it back in for version 11.
MAGMA is a linear algebra library that is designed for multicore nodes with GPUs. It can be thought of as an improvement over LAPACK for such nodes. MAGMA is capable of using the Tensor Cores of the V100 GPUs of Traverse.
MAGMA is available on Anaconda Cloud from the IBM WML-CE channel.
Here is a sample build from source of MAGMA on Traverse:
$ ssh traverse $ cd software $ wget http://icl.utk.edu/projectsfiles/magma/downloads/magma-2.5.3.tar.gz $ tar zxf magma-2.5.3.tar.gz $ cd magma-2.5.3 $ wget https://raw.githubusercontent.com/jdh4/advanced_traverse/master/numerical_libraries/make.inc $ module load cudatoolkit/10.2 $ export CUDADIR=/usr/local/cuda-10.2 $ make $ make install prefix=$HOME/software/magma
ESSL is a numerical library for linear algebra, eigensystem analysis, Fourier transforms, convolutions and correlations, sorting and searching, interpolation, numerical quadrature and random number generation. With respect to its linear algebra routines, ESSL is not a full implementation of BLAS/LAPACK.
The header files and libraries are here:
$ ls -lL /usr/include/*essl* -rw-r--r--. 1 bin bin 171727 Feb 24 2018 /usr/include/essl.h -rw-r--r--. 1 bin bin 4187 Jun 3 2016 /usr/include/essl_lapacke_config.h -rw-r--r--. 1 bin bin 64882 Jan 16 2018 /usr/include/essl_lapacke.h $ ls -lL /usr/lib64/*essl*.so -rw-r--r--. 1 bin bin 45719787 Mar 29 2018 /usr/lib64/libessl6464.so -rw-r--r--. 1 bin bin 53379191 Mar 29 2018 /usr/lib64/libesslsmp6464.so -rw-r--r--. 1 bin bin 54737430 Mar 29 2018 /usr/lib64/libesslsmpcuda.so -rw-r--r--. 1 bin bin 53925425 Mar 29 2018 /usr/lib64/libesslsmp.so -rw-r--r--. 1 bin bin 46826939 Mar 29 2018 /usr/lib64/libessl.so
The functionality that can run on GPUs using libesslsmpcuda.so is listed on this page.
According to the PETSc installation page:
"Sadly, IBM's ESSL does not have all the routines of BLAS and LAPACK that some packages, such as SuperLU expect; in particular slamch, dlamch and xerbla. In this case instead of using ESSL we suggest --download-fblaslapack."
There are example builds of PETSc on Traverse on this page.
For profiling there is gprof and MAP. For GPU codes, the NVIDIA toolkit provides nsys and nv-nsight-cu-cli. To view the output you will need to move the report file to an x86_64 machine to use nsight-sys or nv-nsight-cu (documentation).
Our Arm Forge license includes the DDT parallel debugger and the MAP profiler. Both of these tools can be used for parallel codes that use GPUs. Each can be used on up to 512 processes across all users at once.
Traverse uses environment modules to manage software packages and simplify environment setup. To see which modules are available, use the following command:
$ module avail
Automatic Module Substitution
To ease the transition from RHEL7 to RHEL8, modules from RHEL7 are automatically substituted for the appropriate module on RHEL8. For instance:
$ module load openmpi/gcc/3.1.4/64 $ module list Currently Loaded Modulefiles: 1) openmpi/gcc/4.0.4/64
Any module that ends in "@" in the output of module avail is an alias and it will be replaced when loaded.
The university uses a 'flat' hierarchy so that you can see all the modules that are available all at once with the module avail command. To make it clear what library or application modules go with what compiler or MPI implementation, their modules have multiple levels of version information.
When loading modules in this environment, be sure to specify the complete module name. Otherwise, you may not get the module you hoped for. For example, if you load a compiler module, and then just say module load openmpi, you may not get the correct version of Open MPI for your compiler.
Modules are added as needed. You should always check for the latest versions available by using the module avail command.
For PPPL users: Traverse uses the Slurm scheduler, which we have been using at PPPL since January of 2017. The main differences between Slurm on Traverse and the other PPPL clusters are the job limits and the availability of GPUs on all the nodes. Slurm implements limits through a concept named QOS (Quality of Service). Run the command below to see each QOS, it's priority (the higher the number, the higher the priority), and it's limits:
To see the time limits on the different queues look at the "Job Scheduling (QOS parameters)" section on this page near the top.
Some notes about the terms used above:
- maxTRESPU = Max number of trackable resources (TRES) used by that QOS. Using traverse-short as an example, all of the CPU-cores simultaneously in use in QOS traverse-short cannot exceed 5888 CPU-cores, regardless of how jobs are being run, or by how many different users.
- MaxCPUsPU = The maximum number of trackable resources a single user can use at one time in that QOS. Using traverse-short as an example, no single user can use more than 588 CPU-cores at a time, regardless of how many jobs are running in that user's name in that QOS.
- MaxJobsPU = The maximum numbers of jobs that can be running by one user in that QOS. For example, in traverse-medium, one user can be have 10 jobs running at once, but only 2 in traverse-test.
For PPPL users: Since Traverse uses Slurm, running jobs is very similar to running jobs on PPPL's existing clusters. There are several important differences:
- You must specify your account as 'pppl' with the -A switch. This is not the same type of account as a user account. This is more like a 'bank' account that is used to keep track of which groups are using the cluster and how much. In this case, all PPPLers are in the same account, 'pppl'.
- You do not need to specify which QOS you want to use. The job submission filter looks at job size and time limit and automatically assigns the appropriate QOS based on the resource requirements of the job.
- Note that when running the sbatch command to submit your job, your environment may not be fully exported. Because of this, you must load the required modules for the job in the Slurm script. This is also true for interactive allocations with salloc. That is, when you land on the compute node you must reload your modules.
See example Slurm scripts for running jobs on Traverse and the other Princeton HPC clusters. Below are a few examples specific to Traverse.
The Slurm script below runs 4 MPI tasks on a single node (--ntasks-per-node=4), and 8 OpenMP threads per task (export OMP_NUM_THREADS=8), with 1 thread per physical core (export OMP_PLACES=cores). Each MPI task is also bound to 1 GPU (--gpus-per-task=1), the one that is the closest (--gpu-bind=map_gpu:0,1,2,3). Note that each Power 9 socket (processor) has 16 physical cores and 4 hardware threads per core, so the number of hardware threads per socket is 64, and the total number per node is 128. In SLURM, a "cpu" corresponds to a hardware thread, so in order for OMP_PLACES=cores to work we need to let SLURM know how many "cpus" are available to each task. Since we have 4 MPI tasks per node (2 per socket), the number of cpus per task is 32 (--cpus-per-task=32).
#!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --ntasks-per-socket=2 #SBATCH --cpus-per-task=32 #SBATCH --gpu-bind=map_gpu:0,1,2,3 #SBATCH --gpus-per-task=1 #SBATCH --time=00:30:00 module purge module load cudatoolkit/11.1 module load openmpi/gcc/4.0.4/64 export OMP_NUM_THREADS=8 export OMP_PLACES=cores srun ./hello_jsrun_new | sort >& job_4MPI_8_OMP_1thread_per_core.out
Four nodes have been set aside during normal working hours for test jobs. To qualify for this queue your job must run for less than an hour, use 4 nodes or less, and you must have the following directive in your Slurm script:
The above also holds true for interactive allocations through the salloc command (for more see this page).
Below is a schematic diagram of a single node of Traverse:
Recall that there are 16 physical cores per CPU on Traverse with each core supporting up to 4 hardware threads. The ability to run multiple threads of execution per core is called Simultaneous Multithreading (SMT). The hardware threads within a core share resources such as the L1 cache. For this reason it is often found that using only 1 or 2 hardware threads per core leads to optimal performance.
Let's say that you want to run an MPI application using all 32 cores per node with only one process per core (i.e., 1 hardware thread per core). The following directives may be used for this case:
#SBATCH --nodes=1 #SBATCH --ntasks=32 #SBATCH --cpus-per-task=1 #SBATCH --ntasks-per-core=1
To verify that only a single thread is being used per core you can ssh to the compute node where the job is running and run the htop command.
Note that optimal values of nodes, ntasks, cpus-per-task and ntasks-per-core must be determined empirically for each code. One should construct a table and carry out the appropriate benchmark runs to determine these values.
Certain MPI codes that use GPUs may benefit from CUDA MPS (see ORNL docs), which enables multiple processes to concurrently share the resources on a single GPU. To use MPS simply add this directive to your Slurm script:
In most cases users will see no speed-up. Codes where the individual MPI processes underutilize the GPU should see a performance gain.
If you need to use graphical applications on the Traverse head node such as DDT, MAP or an IDE then consider using TurboVNC. TurboVNC is based on VNC which has many performance advantages over X11 forwarding (i.e., ssh -X). Begin by reading this page while substituting "traverse" for "tigressdata". Be sure to use the shell functions at the bottom of the page to quickly setup a TurboVNC session.
Send an email to firstname.lastname@example.org. User support for Traverse is being shared between PPPL and Princeton University. For that reason, do not open tickets in the PPPL "Service Now" system for issues regarding Traverse. Instead, send an email describing your problem to email@example.com. This will automatically create a ticket that will be seen by both the PPPL and Princeton University Research Computing support staff.