Traverse

Traverse Supercomputer

The Traverse cluster is composed of 46 IBM POWER9 nodes with four NVIDIA V100 GPUs per node. The cluster is primarily intended to support research at the Princeton Plasma Physics Lab (PPPL). Traverse is also available to Princeton researchers whose work is particularly suited to the architecture of this system either because it is very similar to the Summit cluster at Oak Ridge National Laboratory or because the application to be run can take particular advantage of the NVLink architecture. Programs that move a lot of data in or out of the GPU should see an especially large speed up.

System Configuration and Usage

General Guidelines

The system head node, traverse, should be used for interactive work only, such as compiling programs and submitting jobs as described below. No jobs should be run on the head node other than brief tests that last no more than a few minutes. Where practical, we ask that you entirely fill the nodes so that CPU core fragmentation is minimized.

Please remember that these are shared resources for all users.

Running Jobs

All jobs must be run through the Slurm scheduler.

Maintenance Window

Traverse will be down for maintenance the second Tuesday of the month from 6 AM - 2 PM.

ACKNOWLEDGING RESEARCH cOMPUTING

Please see this page for the Acknowledgements section of your publications and presentations.

Hardware Configuration
  Processor
Speed
Nodes Cores
per Node
Memory
per Node
Total Cores Inter-connect Peak
Performance
Traverse
IBM Linux Cluster
2.7 GHz IBM POWER9
 
46 32/128 SMT 256 GB 1472/5888 SMT  EDR Infiniband 1435 TFLOPS
(GPU info) 1530 MHz V100 SXM2   4 GPU/node 32 GB/GPU 186 GPUs NVLink 7.8 TFLOPS/GPU

Distribution of CPU and memory

There are 5888 threaded SMT processors available, 128 per node. Each node contains 256 GB of memory. The nodes are assembled into 12 node per rack groups where each rack has a 1:1 EDR Infiniband connection. There is oversubscription between racks at 2:1. Each node has NVLink connected GPUs with two GPUs per CPU socket.  EDR Infiniband is connected at full speed to each CPU socket over PCIv4 connections.

SMT stands for simultaneous multithreading. Each node of Traverse has 2 CPUs with 16 physical cores per CPU. Each physical core has 4 floating point units. Hence, SMT enables 128 threads per node.

The V100 GPUs have 640 Tensor Cores (8 per streaming multiprocessor) where half-precision Warp Matrix-Matrix and Accumulate (WMMA) operations can be carried out. That is, each Tensor Core can multiply two 4 x 4 matrices together in half-precision and add the result to a third matrix which is in full precision. This is especially useful for training and inference on deep neural networks.

The nodes are all connected through InfiniBand switches for MPI traffic, /scratch/gpfs file system, and NFS.  Both /tigress and /projects are connected over NFS. Gigabit Ethernet is used for other communication.

Job Scheduling (QOS parameters)

All jobs must be run through the scheduler on Traverse.

Jobs are prioritized through the Slurm scheduler based on a number of factors: job size, run times, node availability, wait times, and percentage of usage over a 30 day period as well as a fairshare mechanism to provide access for large contributors. The policy below may change as the job mix changes on the machine.

Jobs will move to the test, short, medium or long quality of service as determined by the scheduler. They are differentiated by the wallclock time requested as follows:

QOS Time Limit Jobs per user Cores per Job Cores Available
test 1 hour 2 jobs no limit no limit
short 4 hours 10 jobs no limit no limit
medium 24 hours 6 jobs 3072 cores 5888 cores/120 GPUs
long 144 hours
(6 days)
4 jobs no limit 2944 cores/92 GPUs

In most cases, these are the maximum numbers and limits may be placed if demand requires. Use the "qos" command to view the actual values in effect.

There is also a special system reservation available for the "pppl" group of 4 nodes. This is set from 9 AM - 5 PM on weekdays and should be used for quick testing of code and not for any production runs. To use this add the --reservation=test flag to your job script.

Recommended File System Usage (/home, /scratch, /tigress)

/home (shared via NFS to all the compute nodes) is intended for scripts, source code, executables and small static data sets that may be needed as standard input/configuration for codes.

/scratch/gpfs is intended for dynamic data that requires higher bandwidth I/O. Files are NOT backed up so this data should be copied or moved to persistent storage as soon as it is no longer needed for computations.

/tigress and /projects (shared using GPFS) is intended for more persistent storage. Users are provided with a default quota of 512 GB when they request a directory in this storage, and that default can be increased by requesting more. We do ask people to consider what they really need, and to make sure they regularly clean out data that is no longer needed since this filesystem is shared by the users of all our systems. See /tigress Usage Guidelines for more information.

/tmp (local to each compute node with a capacity of 3.2 TB) is intended for data local to each task of a job. It will be cleaned out at the end of each job. This is the fastest storage for access.

Traverse User Guide

The Traverse supercomputer is located in Princeton University's HPCRC data center. Eighty percent of the cluster is reserved for PPPL research while the balance belongs to a small number of research groups on main campus.

Outline

Click on a link below to quickly jump to a section:

     Hardware Overview
     Performance
     Access
     Hardware Details
     GPU Features
           NVLink  |  GPUDirect  |  Tensor Cores
     Storage
     Data Transfer
     Software
           Python  |  Jupyter  |  Machine Learning
     GPU Programming
           GPU-Accelerated Libraries  |  OpenACC  | CUDA
     Compilers
           GNU GCC  |  NVIDIA HPC SDK  |  PGI  |  IBM XL
     Numerical Libraries
           NVIDIA Math  |  MAGMA  |  ESSL  |  PETSc
     Debuggers and Profilers
     Scheduler Policies
     Submitting Jobs
           Slurm  |  Reservation Queue  |  Simultaneous Multithreading  |  CUDA Multi-Process Service
     TurboNVC
     Getting Help

 

Hardware Overview

Traverse consists of:

  • 46 IBM AC922 POWER9 nodes, with each node having:
    • 2 IBM POWER9 processors (sockets)
      • 16 cores per processor
      • 4 hardware threads per core
    • 32 cores (128 hardware threads) per node
    • 256 GB of RAM per node
    • 4 NVIDIA V100 GPUs (2 per socket) with 32 GB of memory each
    • 3.2 TB NVMe (solid state) local storage (not shared between nodes)
    • EDR InfiniBand (100 Gb/s bi-directional per port), 1:1 per rack, 2:1 rack-to-rack interconnect
    • GPFS high-performance parallel scratch storage
    • Globus data transfer node (10 GbE external, EDR to storage)
  • InfiniBand Network
    • EDR InfiniBand (100 Gb/s)
    • Fully non-blocking (1:1) within a chassis, 2:1 oversubscription between chassis
  • Storage
    • Home directories (/home)
      • NFS
      • 5 TB total space
      • User quota: 10 GB (request more)
      • Uses InfiniBand (using IP over InfiniBand, or IPoIB)
      • backed up
    • Scratch space (/scratch/gpfs)
      • GPFS parallel filesystem
      • 2.9 PB
      • User quota: 500 GB
      • Uses InfiniBand
      • NOT BACKED UP
    • Local scratch (/tmp)
      • NVMe drive per compute node (3.2 TB)
      • See this page for usage
      • NOT BACKED UP

The head node traverse.princeton.edu can be used for compiling codes, running short tests, submitting jobs, etc. Make sure that you do not use more than 10% of the machine (cores and memory) for more than 10 minutes at a time since it is shared by all users. There are two V100 GPUs on the head node.

Traverse was upgraded to the RHEL8 operating system in September of 2020.

Traverse provides an on-ramp to the Summit supercomputer at the Oak Ridge National Laboratory, which is also composed of POWER9/V100 nodes. Summit has 150 times the number of GPUs of Traverse. You can learn a lot about Traverse by reading the Summit user guide.

The Traverse nodes are arranged in racks of 12 nodes. Compute node traverse-k04g5 is node 5 in rack 4.

 

Performance

The High Performance LINPACK (HPL) benchmark was used to measure the performance of Traverse. The theoretical peak of the system is about 1.3 petaflops and HPL measured 1.1 petaflops. Note that 97% of the compute power of Traverse comes from the NVIDIA V100 Volta GPUs. Read an article about the debut of Traverse in 2019.

 

Access

To request access to Traverse, please email Prentice Bisbal (pbisbal@pppl.gov) or Stephane Ethier (ethier@pppl.gov) and include a brief description of your code and whether or not it is GPU-enabled.

Once you have been given access to Traverse, you can log in with the following command:

$ ssh <YourNetID>@traverse.princeton.edu

The command above will work from any system on the PPPL network. Traverse uses the university's central authentication system (CAS), so you log in using your PU NetID and its associated password, followed by a challenge from the DUO authentication system. If you have never used DUO to access the systems on campus, please refer to the instructions on this page. Also, if you have never logged into the PU Linux systems, or haven't recently, you may need to request Unix access for your NetID.

NOTE: There is a way to authenticate only once with DUO during a session. See the Multiplexing Approach in these instructions.

Accessing Traverse outside of PPPL and Princeton University Networks

If connecting to Traverse from a location outside of the PPPL or Princeton University networks, a VPN connection is required. You can use either the PPPL "Secure Pulse" connection to the PPPL network or the Princeton University "Secure Remote Access (SRA)" connection. See these instructions to use the Princeton University VPN.

 

Hardware Details

Each node has 2 POWER9 CPUs at 2.7 GHz. Each CPU is composed of 16 cores where each core supports 4 hardware threads. Slurm allows jobs with up to 128 tasks per node. Note that many applications will run faster when only using 1 hardware thread per core. To configure this see the Simultaneous Multithreading section below. There is 256 GB of RAM per node.

Below is a schematic diagram of a single node of Traverse:

Traverse Node

 

Information about the CPU is available from the lscpu command:

$ lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        6
Model:               2.3 (pvr 004e 1203)
Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2300.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node8 CPU(s):   64-127
NUMA node252 CPU(s): 
NUMA node253 CPU(s): 
NUMA node254 CPU(s): 
NUMA node255 CPU(s):

We see from the above that there is a NUMA node associated with each CPU and GPU. To learn more about the NUMA nodes run this command: numactl -H

As shown in the schematic diagram above, each node has 4 NVIDIA V100 GPUs. Each GPU has 32 GB of memory. The transfer speed between the GPU and its memory is about 800 GB/s. Note that there are 6 channels (or 3 bricks) for the NVLink giving 75 GB/s per direction (150 GB/s bi-directional). Summit is limited to only 50 GB/s per direction.

 

GPU Features

Each V100 GPU contains 80 streaming multiprocessors (SM). Below is a diagram of an SM on Traverse:

NVLink

NVLink is the term used described the fast interconnect between GPU-to-GPU and GPU-to-CPU on Traverse. As shown in the schematic diagram above one can achieve transfer rates of 150 GB/s. This is much faster than the limit of 16 GB/s on TigerGPU.

NVLink

GPUDirect

Using GPUDirect, multiple GPUs, network adapters, solid-state drives and NVMe drives can directly read and write CUDA host and device memory, eliminating unnecessary memory copies, dramatically lowering CPU overhead, and reducing latency, resulting in significant performance improvements in data transfer times for applications running on NVIDIA Tesla GPUs.

NVIDIA GPUDirect

Tensor Cores

The V100 GPUs have 640 Tensor Cores (8 per streaming multiprocessor) where half-precision (16 bits FP16) Warp Matrix-Matrix and Accumulate (WMMA) operations can be carried out. That is, each Tensor Core can multiply two 4 x 4 matrices together in half-precision and add the result to a third matrix which is in full precision. This is useful for training and inference on deep neural networks and many other computations that are rooted in linear algebra.

There are several use cases where the Tensor Cores can be utilized on the V100 GPUs of Traverse. In general it is algorithms that use Level 3 BLAS routines. In almost all cases the user needs to explicitly take action to use the Tensor Cores.

The NVIDIA Apex library allows for automatic mixed-precision (AMP) training and distributed training of neural networks. It is included with an installation of PyTorch from WML-CE. To see the performance benefit of the Tensor Cores, download the dcgan example and run it with and without using the Tensor Cores. Using 16 hardware threads one finds a speed-up of about 10%. Note that to use the fp16 kernels the dimension of each matrix must be a multiple of 8. Read about the constraints here.

Another example using Fortran is here. There are algorithms in the MAGMA library (discussed below) that can utilize the Tensor Cores of V100 GPUs. Mixed precision Krylov and Multigrid solvers have also been developed, as discussed in this presentation.

NVIDIA has introduced a larger number and different types of Tensor Cores in the A100 GPU. Additionally, in many cases the Tensor Cores are automatically used and many of the constraints have been relaxed. There are no Tensor Cores on the P100 GPUs on TigerGPU.
 

Storage

There are two locations where you can store your files: /home and /scratch/gpfs. Home directories are on an NFS filesystem and are backed up. /scratch/gpfs is a high-performance GPFS parallel filesystem where you should run your simulations, and it is NOT backed up. Home directories have a user quota of 10 GB (request an increase). Your space on /scratch/gpfs has user quota of 500 GB. When your account on Traverse is created, a directory named /scratch/gpfs/<YourNetID> is created for you, in addition to your home directory.

For PPPL users: Please note that directories on Traverse are named /home/<username>, which differs from PPPL's conventions of /u/<username>.  If you have the full path to your home directory hard-coded in any scripts you plan on running on Traverse, please be sure to modify them to use this different path. It is best to use the environment variable $HOME to refer to your home directory, since that is much more portable.

GPFS stands for General Parallel File System. GPFS is a high-performance parallel filesystem that provides much higher IOPS and bandwidth than a non-parallel filesystem. Due to its parallel performance and larger quota, /scratch/gpfs should be used for all file I/O for jobs running on Traverse. That means all input data should be copied to /scratch/gpfs before you start your job, and your job should write all its results to /scratch/gpfs. When your job is done, results can be compressed and copied over to your home directory, and/or copied back to PPPL for long-term storage.

Each compute node has an NVMe (non-nolatile memory express) drive with 3.2 TB capacity for fast reads and writes. These drives are local to each node. The path to the NVMe drive is /tmp. Note that all files written to /tmp are removed when the job finishes. Therefore you must copy any files from /tmp to a /scratch/gpfs before the job finishes. For more information on using the NVMe drives see this page. On Traverse, one should find that a 12 GB file can be copied from /scratch/gpfs to /tmp on a compute node in about 2 seconds. A much longer time is needed to copy the same file to /home. Note that one may not see fast writes to /tmp if your application writes in small chunks or even line-by-line. /tmp is an alias for /scratch. The only difference between the two is that If you write to /scratch then your files will not be deleted after the job finishes. This is not preferred. Please write to /tmp.

 

Data Transfer

Currently, the best way to move a small amount of data back and forth between PPPL and Traverse is to use scp, rsync over ssh, or bbcp (official bbcp documentationmore user-friendly documentation). To transfer a large amount of data, a dedicated Globus endpoint is available. The name of the endpoint is "Princeton Traverse Scratch DTN" and, as its name implies, it has direct access to the data on the /scratch/gpfs filesystem on Traverse.

 

Software

The software environment on Traverse is very similar to the other Research Computing clusters such as Tiger. See the general documentation for Princeton University Research Computing.

If you find that you need software packages that are not installed on Traverse then please send a request via e-mail to cses@princeton.edu. Please note that commercial applications are not always available for the POWER architecture (e.g., MATLAB).

Anaconda Python

The Anaconda Python distribution should be used when working with Python on Traverse:

Python 3

$ module avail anaconda3
$ module load anaconda3/2020.7
$ python --version

Python 2

$ module load anaconda/2019.10
$ python --version

See this page for more information on using the Anaconda Python distribution on the Research Computing clusters. One may also consider installing Anaconda or Miniconda (see "Other Resources") for POWER9. There are many useful Anaconda packages in the IBM Watson Machine Learning Community Edition channel.

System Python

The system Python is available if needed. This can be useful for some tasks such as building codes:

$ python
-bash: python: command not found

$ python3
Python 3.6.8 (default, Dec  5 2019, 16:11:43) 
[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

$ python2
Python 2.7.17 (default, Oct 30 2019, 17:39:41) 
[GCC 8.3.1 20190507 (Red Hat 8.3.1-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

In general, for scientific work one wants to use the Anaconda Python distribution which is described above.

Jupyter Notebooks

To run a Jupyter Notebook or JupyterLab on the Traverse head node follow the directions under "Running on Tigressdata" on this page while substituting "traverse" for "tigressdata". There are also directions for running on a compute node. If using a VPN is not an option then use the directions under "Avoiding Using a VPN from Off-Campus".

Machine Learning

There are many useful Anaconda packages in the IBM Watson Machine Learning Community Edition channel. Here is a partial list of popular packages:

Research Computing also maintains dedicated documentation for TensorFlow and PyTorch. As those pages note, if you need a newer version of the software found in the IBM WML-CE channel then consider using the early access channel: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access

To get started with RAPIDS, create a conda environment:

$ ssh <YourNetID>@traverse.princeton.edu
$ module load anaconda3
$ CHNL=https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda
$ conda create --name rapids-env --channel $CHNL cudf cuml
# accept the license agreement
Julia

To take advantage of the GPUs on Traverse, one needs the CUDA package for Julia, which requires Julia 1.3 or later. Here is a procedure to build version 1.5:

$ cd /scratch/gpfs/<YourNetID>

$ wget https://github.com/JuliaLang/julia/releases/download/v1.5.2/julia-1.5.2-full.tar.gz
$ tar xvf julia-1.5.2-full.tar.gz
$ cd julia-1.5.2

# create a Make.user file containing these 3 lines:
USE_BINARYBUILDER:=0
USE_BINARYBUILDER_LLVM:=1
JULIA_PRECOMPILE:=0

$ make
$ prefix=$HOME/local/julia/1.5.2  # or choose a different install location
$ make prefix=$prefix install
$ export PATH=$prefix/bin:$PATH
Other Software

Other software can be seen by running the module avail command. There is a small number of software packages in the /opt directory.

 

GPU Programming

There are several ways to write a code that will run on GPUs. Here is a list of the most widely used methods, from the easiest to implement to the most difficult (but most powerful). 

GPU-Accelerated Libraries

There are many "CUDA-based" libraries that can be used to run parts of your code on a GPU. It can be as simple as calling a library function or routine. Just make sure that the section of code that you put on the GPU is compute intensive otherwise you will not see a speedup. See the Numerical Libraries section below.

OpenACC

OpenACC is a "directive-based" programming model designed for scientists and engineers interested in porting their codes to a wide-variety of heterogeneous HPC hardware platforms and architectures with significantly less programming effort than required with a low-level model (such as CUDA). In practice, OpenACC is mainly used for porting Fortran, C, and C++ codes to GPUs. Directives are special comments (Fortran) or preprocessor pragmas (C/C++) that instruct the compiler to generate GPU instructions for given sections of a code. The best OpenACC compiler is PGI/Nvidia SDK . gcc also supports it but the GPU code is usually not as performant. Here is an example of OpenACC usage:

program laplace
  implicit none
  integer, parameter :: fp_kind=kind(1.0d0)
  integer, parameter :: n=1024, m=1024, iter_max=1000
  integer :: i, j, iter
  real(fp_kind), dimension (:,:), allocatable :: A, Anew
  real(fp_kind) :: tol=1.0e-6_fp_kind, error=1.0_fp_kind
  real(fp_kind) :: start_time, stop_time

  allocate ( A(0:n-1,0:m-1), Anew(0:n-1,0:m-1) )

  A    = 0.0_fp_kind
  Anew = 0.0_fp_kind

  ! Set B.C.
  A(0,:)    = 1.0_fp_kind
  Anew(0,:) = 1.0_fp_kind

  write(*,'(a,i5,a,i5,a)') 'Jacobi relaxation Calculation:', n, ' x', m, ' mesh'

  call cpu_time(start_time)

  iter=0

!$acc data copyin(Anew), copy(A)
  do while ( error .gt. tol .and. iter .lt. iter_max )
    error=0.0_fp_kind

!$acc kernels
    do j=1,m-2
      do i=1,n-2
        Anew(i,j) = 0.25_fp_kind * ( A(i+1,j  ) + A(i-1,j  ) + &
                                     A(i  ,j-1) + A(i  ,j+1) )
        error = max( error, abs(Anew(i,j)-A(i,j)) )
      end do
    end do
!$acc end kernels

    if(mod(iter,100).eq.0 ) write(*,'(i5,f10.6)'), iter, error
    iter = iter + 1

!$acc kernels
    do j=1,m-2
      do i=1,n-2
        A(i,j) = Anew(i,j)
      end do
    end do
!$acc end kernels

  end do
!$acc end data

  call cpu_time(stop_time)
  write(*,'(a,f10.3,a)')  ' completed in ', stop_time-start_time, ' seconds'

  deallocate (A,Anew)
end program laplace

Here is a link to a short OpenACC tutorial.

CUDA

You should make every effort to take advantage of the GPU-enabled libraries from NVIDIA and other vendors to accelerate your code. If the APIs are too rigid and they do not fit your needs then consider OpenACC as described above. Finally, there is the option of writing custom GPU kernels from scratch in C++ or Fortran. See this workshop for an introduction to CUDA at Princeton.

 

Compilers

GCC

The system version of GCC is 8.3.1. For instance:

$ gcc --version
gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)

$ g++ --version
g++ (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)

$ gfortran --version
GNU Fortran (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)

A good starting point for GCC optimization flags on Traverse is:

$ gcc -Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG -o myprog myprog.c

Take a look at man gcc for more. While you should prefer the system version of GCC (8.3.1), in some cases it may be necessary to use an earlier version such as for codes that require older versions of the CUDA Toolkit and PGI compiler. The rh/devtoolset/7 environment module exists for this purpose:

$ module load rh/devtoolset/7
$ gcc --version
gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)

$ g++ --version
g++ (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)

$ gfortran --version
GNU Fortran (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)

 

NVIDIA HPC SDK

NVIDIA acquired PGI in 2013. There will be no future releases of PGI compilers. Instead NVIDIA offers the HPC SDK which is a collection of compilers, libraries and tools supporting multiple programming models on multicore CPUs and GPU nodes. See the documentation. The HPC SDK provides the recommended compilers for OpenACC codes.

  • nvcc is the C++ compiler for CUDA kernels
  • nvc is the C compiler
  • nvc++ is the C++ compiler
  • nvfortran is the Fortran compiler (it can generates GPU code automatically for the V100)

NVIDIA recommends using NVSHMEM instead of Open MPI when developing a parallel code for GPU nodes. One can also replace slow MPI calls with the appropriate routine in NVSHMEM.

Traverse provides the NVIDIA HPC SDK as a module in two variants:

$ module avail nvhpc
--- /opt/share/Modules/modulefiles ---
nvhpc-nocompiler/20.7  nvhpc/20.7

The V100 GPU has a compute capability of 7.0 so use these flags to compile a kernel:

$ module load nvhpc
$ nvcc -O3 -arch=sm_70 --use_fast_math -o mykernel mykernel.cu

To compile an OpenACC code written in C:

$ module load nvhpc
$ nvc -O3 -acc -gpu=cc70 -Minfo -Mneginfo -o laplace laplace2d.c

OpenACC codes can be found here in C and Fortran and a workshop is here.

Note that the CUDAToolkit is included in the nvhpc.

PGI

There will be no future releases of the PGI compilers. Users should favor their replacements which are available in the NVIDIA HPC SDK (see above). Here are the available PGI modules:

$ module avail pgi
--- /opt/share/Modules/modulefiles ---
pgi/19.5/64  pgi/19.9/64  pgi/20.4/64

Also, we've made changes to the CUDA Toolkit 10.2 to allow use with PGI 20.4. To do this you will have to add the follow define flag to the nvcc compile line (e.g., for building PETSc one would do that by adding it to CUDAFLAGS):

-D_PGIC_PRINCETON_OVERRIDE_

 

IBM XL C/C++ and Fortran

We have the community edition (version 16.1.1) of the IBM XL compilers. This version was released in December 2018. While it offers several GPU features, it can only go as high as version 10.1 of CUDA. See this video for an overview.

Users should favor GCC or one of the other compilers over XL. A newer, paid version with additional optimizations is available.

To get started:

$ xlc --version
IBM XL C/C++ for Linux, V16.1.1 (Community Edition)
Version: 16.01.0001.0003

$ xlc -qhelp
$ xlC -qhelp
$ xlf -qhelp

A good starting point for XL optimization flags on Traverse is:

$ xlc -Ofast -qarch=pwr9 -qtune=pwr9 -qsimd=auto -DNDEBUG -o myprog myprog.c

For more on optimization see Code Optimization with IBM XL Compilers. This guide says: "VMX and VSX machine instructions can execute up to sixteen operations in parallel."

Intel

The Intel compilers cannot be used on the POWER architecture of Traverse.

 

MPI and CUDA-Aware MPI

In addition to traditional builds of the MPI library, Traverse also offers CUDA-aware MPI builds which allow for data on a GPU to be sent to another GPU without going through a CPU. According to NVIDIA, regular MPI implementations pass pointers to host memory, staging GPU buffers through host memory using cudaMemcopy. With CUDA-aware MPI, the MPI library can send and receive GPU buffers directly, without having to first stage them in host memory. To see the CUDA-aware MPI modules:

$ module avail openmpi/cuda

A simple code that uses CUDA-aware MPI is here. In the figure below, RDMA is remote direct memory access.

 

Numerical Libraries

NVIDIA Math

NVIDIA offers a range of GPU-accelerated math libraries. They are designed as drop-in replacements for commonly used CPU-only libraries making it easy to incorporate them in your code. Here is a list of selected libraries:

  • cuDNN - GPU-accelerated library of primitives for deep neural networks
  • cuBLAS - GPU-accelerated standard BLAS library
  • cuSPARSE - GPU-accelerated BLAS for sparse matrices
  • cuRAND - GPU-accelerated random number generation (RNG)
  • cuSOLVER - Dense and sparse direct solvers for computer vision, CFD and other applications
  • cuFFT - GPU-accelerated library for Fast Fourier Transforms
  • NPP - GPU-accelerated image, video, and signal processing functions
  • NCCL - Collective Communications Library for scaling apps across multiple GPUs and nodes
  • nvGRAPH - GPU-accelerated library for graph analytics

For the complete list see GPU libraries by NVIDIA. You can inspect the available libraries for a given CUDA Toolkit version like so:

$ ls -lL /usr/local/cuda-11.1/lib64/lib*.so

Note that NVIDIA moved cuBLAS out of the toolkit in version 10 (to /usr/lib64 on our system) and then moved it back in for version 11.

MAGMA

MAGMA is a linear algebra library that is designed for multicore nodes with GPUs. It can be thought of as an improvement over LAPACK for such nodes. MAGMA is capable of using the Tensor Cores of the V100 GPUs of Traverse.

MAGMA is available on Anaconda Cloud from the IBM WML-CE channel.

Here is a sample build from source of MAGMA on Traverse:

$ ssh traverse
$ cd software
$ wget http://icl.utk.edu/projectsfiles/magma/downloads/magma-2.5.3.tar.gz
$ tar zxf magma-2.5.3.tar.gz
$ cd magma-2.5.3
$ wget https://raw.githubusercontent.com/jdh4/advanced_traverse/master/numerical_libraries/make.inc
$ module load cudatoolkit/10.2
$ export CUDADIR=/usr/local/cuda-10.2
$ make
$ make install prefix=$HOME/software/magma
IBM Engineering and Scientific Subroutine Library (ESSL)

ESSL is a numerical library for linear algebra, eigensystem analysis, Fourier transforms, convolutions and correlations, sorting and searching, interpolation, numerical quadrature and random number generation. With respect to its linear algebra routines, ESSL is not a full implementation of BLAS/LAPACK.

The header files and libraries are here:

$ ls -lL /usr/include/*essl*
-rw-r--r--. 1 bin bin 171727 Feb 24  2018 /usr/include/essl.h
-rw-r--r--. 1 bin bin   4187 Jun  3  2016 /usr/include/essl_lapacke_config.h
-rw-r--r--. 1 bin bin  64882 Jan 16  2018 /usr/include/essl_lapacke.h

$ ls -lL /usr/lib64/*essl*.so
-rw-r--r--. 1 bin bin 45719787 Mar 29  2018 /usr/lib64/libessl6464.so
-rw-r--r--. 1 bin bin 53379191 Mar 29  2018 /usr/lib64/libesslsmp6464.so
-rw-r--r--. 1 bin bin 54737430 Mar 29  2018 /usr/lib64/libesslsmpcuda.so
-rw-r--r--. 1 bin bin 53925425 Mar 29  2018 /usr/lib64/libesslsmp.so
-rw-r--r--. 1 bin bin 46826939 Mar 29  2018 /usr/lib64/libessl.so

The functionality that can run on GPUs using libesslsmpcuda.so is listed on this page.

According to the PETSc installation page:

"Sadly, IBM's ESSL does not have all the routines of BLAS and LAPACK that some packages, such as SuperLU expect; in particular slamch, dlamch and xerbla. In this case instead of using ESSL we suggest --download-fblaslapack."

PETSc

There are example builds of PETSc on Traverse on this page.

 

Debuggers and Profilers

For debugging Python scripts one can use PDB or an IDE like PyCharm. GDB and  DDT available. For DDT see this page for examples as well as the RC website page.

For profiling there is gprof and MAP. For GPU codes, the NVIDIA toolkit provides nsys and nv-nsight-cu-cli. To view the output you will need to move the report file to an x86_64 machine to use nsight-sys or nv-nsight-cu (documentation).

Our Arm Forge license includes the DDT parallel debugger and the MAP profiler. Both of these tools can be used for parallel codes that use GPUs. Each can be used on up to 512 processes across all users at once.

 

Environment Modules

Traverse uses environment modules to manage software packages and simplify environment setup. To see which modules are available, use the following command:

$ module avail

Automatic Module Substitution

To ease the transition from RHEL7 to RHEL8, modules from RHEL7 are automatically substituted for the appropriate module on RHEL8. For instance:

$ module load openmpi/gcc/3.1.4/64
$ module list
Currently Loaded Modulefiles:
 1) openmpi/gcc/4.0.4/64

Any module that ends in "@" in the output of module avail is an alias and it will be replaced when loaded.

The university uses a 'flat' hierarchy so that you can see all the modules that are available all at once with the module avail command. To make it clear what library or application modules go with what compiler or MPI implementation, their modules have multiple levels of version information.

When loading modules in this environment, be sure to specify the complete module name. Otherwise, you may not get the module you hoped for. For example, if you load a compiler module, and then just say module load openmpi, you may not get the correct version of Open MPI for your compiler.

Modules are added as needed. You should always check for the latest versions available by using the module avail command.

 

Scheduler Policies

For PPPL users: Traverse uses the Slurm scheduler,  which we have been using at PPPL since January of 2017. The main differences between Slurm on Traverse and the other PPPL clusters are the job limits and the availability of GPUs on all the nodes. Slurm implements limits through a concept named QOS (Quality of Service). Run the command below to see each QOS, it's priority (the higher the number, the higher the priority), and it's limits:

$ qos

To see the time limits on the different queues look at the "Job Scheduling (QOS parameters)" section on this page near the top.

Some notes about the terms used above:

  • maxTRESPU = Max number of trackable resources (TRES) used by that QOS. Using traverse-short as an example, all of the CPU-cores simultaneously in use in QOS traverse-short cannot exceed 5888 CPU-cores, regardless of how jobs are being run, or by how many different users.
  • MaxCPUsPU = The maximum number of trackable resources a single user can use at one time in that QOS. Using traverse-short as an example, no single user can use more than 588 CPU-cores at a time, regardless of how many jobs are running in that user's name in that QOS.
  • MaxJobsPU = The maximum numbers of jobs that can be running by one user in that QOS. For example, in traverse-medium, one user can be have 10 jobs running at once, but only 2 in traverse-test.

 

Submitting Jobs

For PPPL users: Since Traverse uses Slurm, running jobs is very similar to running jobs on PPPL's existing clusters. There are several important differences:

  1. You must specify your account as 'pppl' with the -A switch. This is not the same type of account as a user account. This is more like a 'bank' account that is used to keep track of which groups are using the cluster and how much. In this case, all PPPLers are in the same account, 'pppl'.
  2. You do not need to specify which QOS you want to use. The job submission filter looks at job size and time limit and automatically assigns the appropriate QOS based on the resource requirements of the job.
  3. Note that when running the sbatch command to submit your job, your environment may not be fully exported. Because of this, you must load the required modules for the job in the Slurm script. This is also true for interactive allocations with salloc. That is, when you land on the compute node you must reload your modules.
EXAMPLE SLURM SCRIPTS

See example Slurm scripts for running jobs on Traverse and the other Princeton HPC clusters. Below are a few examples specific to Traverse.

The Slurm script below runs 4 MPI tasks on a single node (--ntasks-per-node=4), and 8 OpenMP threads per task (export OMP_NUM_THREADS=8), with 1 thread per physical core (export OMP_PLACES=cores). Each MPI task is also bound to 1 GPU (--gpus-per-task=1), the one that is the closest (--gpu-bind=map_gpu:0,1,2,3). Note that each Power 9 socket (processor) has 16 physical cores and 4 hardware threads per core, so the number of hardware threads per socket is 64, and the total number per node is 128. In SLURM, a "cpu" corresponds to a hardware thread, so in order for OMP_PLACES=cores to work we need to let SLURM know how many "cpus" are available to each task. Since we have 4 MPI tasks per node (2 per socket), the number of cpus per task is 32 (--cpus-per-task=32).

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=2
#SBATCH --cpus-per-task=32
#SBATCH --gpu-bind=map_gpu:0,1,2,3
#SBATCH --gpus-per-task=1
#SBATCH --time=00:30:00

module purge
module load cudatoolkit/11.1
module load openmpi/gcc/4.0.4/64
export OMP_NUM_THREADS=8

export OMP_PLACES=cores
srun ./hello_jsrun_new | sort >& job_4MPI_8_OMP_1thread_per_core.out
Reservation Queue

Four nodes have been set aside during normal working hours for test jobs. To qualify for this queue your job must run for less than an hour, use 4 nodes or less, and you must have the following directive in your Slurm script:

#SBATCH --reservation=test

The above also holds true for interactive allocations through the salloc command (for more see this page).

Simultaneous multithreading

Below is a schematic diagram of a single node of Traverse:

Recall that there are 16 physical cores per CPU on Traverse with each core supporting up to 4 hardware threads. The ability to run multiple threads of execution per core is called Simultaneous Multithreading (SMT). The hardware threads within a core share resources such as the L1 cache. For this reason it is often found that using only 1 or 2 hardware threads per core leads to optimal performance.

Let's say that you want to run an MPI application using all 32 cores per node with only one process per core (i.e., 1 hardware thread per core). The following directives may be used for this case:

#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-core=1

To verify that only a single thread is being used per core you can ssh to the compute node where the job is running and run the htop command.

Note that optimal values of nodes, ntasks, cpus-per-task and ntasks-per-core must be determined empirically for each code. One should construct a table and carry out the appropriate benchmark runs to determine these values.

CUDA Multi-Process Service (MPS)

Certain MPI codes that use GPUs may benefit from CUDA MPS (see ORNL docs), which enables multiple processes to concurrently share the resources on a single GPU. To use MPS simply add this directive to your Slurm script:

#SBATCH --gpu-mps

In most cases users will see no speed-up. Codes where the individual MPI processes underutilize the GPU should see a performance gain.

 

TurboVNC for Graphical Applications

If you need to use graphical applications on the Traverse head node such as DDT, MAP or an IDE then consider using TurboVNC. TurboVNC is based on VNC which has many performance advantages over X11 forwarding (i.e., ssh -X). Begin by reading this page while substituting "traverse" for "tigressdata". Be sure to use the shell functions at the bottom of the page to quickly setup a TurboVNC session.

 

Getting Help

Send an email to cses@princeton.edu. User support for Traverse is being shared between PPPL and Princeton University. For that reason, do not open tickets in the PPPL "Service Now" system for issues regarding Traverse. Instead, send an email describing your problem to cses@princeton.edu. This will automatically create a ticket that will be seen by both the PPPL and Princeton University Research Computing support staff.