Tiger

The Tiger cluster is one of Princeton's most powerful clusters. It is meant for running parallel jobs, as all other types of jobs are given low priority here.

Tiger is an HPE Apollo cluster comprised of 408 Intel Skylake CPU nodes. Each CPU processor core has at least 4.8 GB of memory. Every 40-core node is interconnected by a Omnipath fabric with oversubscription. There are 24 nodes per chassis all connected with the full bandwidth. There are no longer any GPUs on Tiger.

For more hardware details, see the Hardware Configuration information below.

How to Access the Tiger Cluster

To use the Tiger cluster you have to request an account and then log in through SSH.

  1. Requesting Access to Tiger

    Access to the large clusters like Tiger is granted on the basis of brief faculty-sponsored proposals (see For large clusters: Submit a proposal or contribute).

    If, however, you are part of a research group with a faculty member who has contributed to or has an approved project on Tiger, that faculty member can sponsor additional users by sending a request to [email protected]. Any non-Princeton user must be sponsored by a Princeton faculty or staff member for a Research Computer User (RCU) account.
     
  2. Logging into Tiger

    Once you have been granted access to Tiger, you can connect by opening an SSH client and using the SSH command as displayed below.

    To log into tiger (VPN required from off-campus):

    $ ssh <YourNetID>@tiger.princeton.edu
    

    For more on how to SSH, see the Knowledge Base article Secure Shell (SSH): Frequently Asked Questions (FAQ). If you have trouble connecting then see our SSH page.

 

How to Use the Tiger Cluster

Since Tiger is a Linux system, knowing some basic Linux commands is highly recommended. For an introduction to navigating a Linux system, view the material associated with our Intro to Linux Command Line workshop. 

Using Tiger also requires some knowledge on how to properly use the file system, module system, and how to use the scheduler that handles each user's jobs. For an introduction to navigating Princeton's High Performance Computing systems, view our Guide to Princeton's Research Computing Clusters. Additional information specific to Tiger's file system, priority for job scheduling, etc. can be found below.

To work with visualizations, or applications that require graphical user interfaces (GUIs), use the Tigressdata visualization node.

To attend a live session of either workshop, see our Trainings page for the next available workshop.
For more resources, see our Support - How to Get Help page.

 

Important Guidelines

Please remember that these are shared resources for all users.

The login node tigercpu should be used for interactive work only, such as compiling programs, and submitting jobs as described below. No jobs should be run on the login node other than brief tests that last no more than a few minutes. Where practical, we ask that you entirely fill the compute nodes so that CPU core fragmentation is minimized.

Refer to our Slurm page for job submission instructions.

 

Hardware Configuration

Clusters Processor
Speed
Nodes Cores
per Node
Memory
per Node
Total Cores Inter-connect Performance:
Theoretical
TigerCPU
HPE Linux Cluster
2.4 GHz Skylake. 408 40 192 GB or 768 GB 16320 Omnipath >1103 TFLOPS

Distribution of CPU and memory

There are 16,320 processors available (40 per node). Each node contains at least 192 GB of memory (4.8 GB per core). The nodes are assembled into 24 node chassis where each chassis has a 1:1 Omnipath connection. There is oversubscription between chassis at 2:1.

There are also 40 nodes with memory of 768 GB (19 GB per core). These larger memory nodes also have SSD drives for faster I/O locally.

The nodes are all connected through Omnipath switches for MPI traffic, GPFS, and NFS I/O and over a Gigabit Ethernet for other communication.

For more technical details, click here to see the full version of the systems table.

 

Job Scheduling (QOS Parameters)

Jobs are prioritized through the Slurm scheduler based on a number of factors: job size, run times, node availability, wait times, and percentage of usage over a 30 day period as well as a fairshare mechanism to provide access for large contributors. The policy below may change as the job mix changes on the machine.

Jobs will move to the test, vshort, short, medium, or long quality of service as determined by the scheduler. They are differentiated by the wallclock time requested as follows:

QOS Time Limit Jobs per User Cores per Job Cores Available
tiger-test 1 hour 1 200 15560
tiger-vshort 5 hours 16 no limit no limit
tiger-short 24 hours 14 4360 13360
tiger-medium 72 hours 10 3840 7840
tiger-long 144 hours
(6 days)
6 1200 5520

Note that the above numbers and limits may be changed if demand requires. Use the "qos" command to view the actual values in effect.

 

Maintenance Window

Tiger will be down for routine maintenance on the second Tuesday of every month from approximately 6 AM to 2 PM. This includes the associated filesystems of /scratch/gpfs, /projects and /tigress. Please mark your calendar. Jobs submitted close to downtime will remain in the queue unless they can be scheduled to finish before downtime (see more). Users will receive an email when the cluster is returned to service.

 

Filesystem Usage and Quotas

/home (shared via NFS to all the compute nodes) is intended for scripts, source code, executables and small static data sets that may be needed as standard input/configuration for codes.

/scratch/gpfs is intended for dynamic data that requires higher bandwidth I/O. Files are NOT backed up so this data should be moved to persistent storage as soon as it is no longer needed for computations. Please remove files on /scratch/gpfs that you no longer need.

/tigress (shared using GPFS) is intended for more persistent storage and should provide high bandwidth I/O (8 GB/s aggregate bandwidth for jobs across 16 or more nodes). Users are provided with a default quota of 512 GB when they request a directory in this storage, and that default can be increased by requesting more. We do ask people to consider what they really need, and to make sure they regularly clean out data that is no longer needed since this filesystem is shared by the users of all our systems. See /tigress Usage Guidelines for more information.

/tmp (local to each compute node) is intended for data local to each task of a job, and it should be cleaned out at the end of each job. This is the fastest storage for access.

 

 

Wording of Acknowledgement of Support and/or Use of Research Computing Resources

"The author(s) are pleased to acknowledge that the work reported on in this paper was substantially performed using the Princeton Research Computing resources at Princeton University which is consortium of groups led by the Princeton Institute for Computational Science and Engineering (PICSciE) and Office of Information Technology's Research Computing."

"The simulations presented in this article were performed on computational resources managed and supported by Princeton Research Computing, a consortium of groups including the Princeton Institute for Computational Science and Engineering (PICSciE) and the Office of Information Technology's High Performance Computing Center and Visualization Laboratory at Princeton University."