Traverse

traverse

 

The Traverse cluster is composed of 46 IBM POWER9 nodes with four NVIDIA V100 GPUs per node. The cluster is primarily intended to support research at the Princeton Plasma Physics Lab (PPPL). Traverse is also available to Princeton researchers whose work is particularly suited to the architecture of this system either because it is very similar to the Summit cluster at Oak Ridge National Laboratory or because the application to be run can take particular advantage of the NVLink architecture that speeds up access between the GPU memory and the CPU memory. Programs that move a lot of data in or out of the GPU should see an especially large speed up.

System Configuration and Usage

General Guidelines

The system head node, traverse, should be used for interactive work only, such as compiling programs and submitting jobs as described below. No jobs should be run on the head node other than brief tests that last no more than a few minutes. Where practical, we ask that you entirely fill the nodes so that CPU core fragmentation is minimized.

Please remember that these are shared resources for all users.

Running Jobs

All jobs must be run through the Slurm scheduler.

Maintenance Window

Traverse will be down for maintenance the second Tuesday of the month from 6 AM - 2 PM.

TRAVERSE SCHEMATIC

<We need a new schematic>

Hardware Configuration
  Processor
Speed
Nodes Cores
per Node
Memory
per Node
Total Cores Inter-connect Theoretical
Performance
Traverse
IBM Linux Cluster
2.7 GHz IBM POWER9
 
46 32/128 SMT 256 GB 1472/5888 SMT  EDR Infiniband  
(GPU info) 1530 MHz V100 SXM2   4 GPU/node 32 GB/GPU 186 GPUs NVLink 1435 TFLOPS

Distribution of CPU and memory

There are 5888 threaded SMT processors available, 128 per node. Each node contains 256 GB of memory. The nodes are assembled into 12 node per rack groups where each rack has a 1:1 EDR Infiniband connection. There is oversubscription between racks at 2:1. Each node has NVLink connected GPUs with two GPUs per CPU socket.  EDR Infiniband is connected at full speed to each CPU socket over PCIv4 connections.

SMT stands for simultaneous multithreading. Each node of Traverse has 2 CPUs with 16 physical cores per CPU. Each physical core has 4 floating point units. Hence, SMT enables 128 threads per node.

The V100 GPUs have 640 Tensor Cores (8 per streaming multiprocessor) where half-precision Warp Matrix-Matrix and Accumulate (WMMA) operations can be carried out. That is, each Tensor Core can multiply two 4 x 4 matrices together in half-precision and add the result to a third matrix which is in full precision. This is especially useful for training and inference on deep neural networks.

The nodes are all connected through InfiniBand switches for MPI traffic, /scratch/gpfs file system, and NFS.  Both /tigress and /projects are connected over NFS. Gigabit Ethernet is used for other communication.

Job Scheduling (QOS parameters)

All jobs must be run through the scheduler on Traverse.

Jobs are prioritized through the Slurm scheduler based on a number of factors: job size, run times, node availability, wait times, and percentage of usage over a 30 day period as well as a fairshare mechanism to provide access for large contributors. The policy below may change as the job mix changes on the machine.

Jobs will move to the test, short, medium or long quality of service as determined by the scheduler. They are differentiated by the wallclock time requested as follows:

QOS Time Limit Jobs per user Cores per Job Cores Available
test 15 minutes 2 jobs no limit no limit
short 4 hours 10 jobs no limit no limit
medium 24 hours 6 jobs 3072 cores 5888 cores/120 GPUs
long 144 hours
(6 days)
4 jobs no limit 2944 cores/92 GPUs

In most cases, these are the maximum numbers and limits may be placed if demand requires. Use the "qos" command to view the actual values in effect.

There is also a special system reservation available for the "pppl" group of 4 nodes. This is set from 9 AM - 5 PM on weekdays and should be used for quick testing of code and not for any production runs. To use this add the --reservation=test flag to your job script.

Recommended File System Usage (/home, /scratch, /tigress)

/home (shared via NFS to all the compute nodes) is intended for scripts, source code, executables and small static data sets that may be needed as standard input/configuration for codes.

/scratch/gpfs is intended for dynamic data that requires higher bandwidth I/O. Files are NOT backed up so this data should be moved to persistent storage as soon as it is no longer needed for computations. These files are cleaned nightly to purge files older than 180 days.

/tigress and /projects (shared using GPFS) is intended for more persistent storage. Users are provided with a default quota of 512 GB when they request a directory in this storage, and that default can be increased by requesting more. We do ask people to consider what they really need, and to make sure they regularly clean out data that is no longer needed since this filesystem is shared by the users of all our systems. See /tigress Usage Guidelines for more information.

/tmp (local to each compute node - 1.8 TB available on each node) is intended for data local to each task of a job. It will be cleaned out at the end of each job. This is the fastest storage for access.