The Tiger cluster is one of Princeton's most powerful clusters. It is meant for running parallel jobs, as all other types of jobs are given low priority here.
Some Technical Specifications:
The Tiger cluster has two parts: tigercpu and tigergpu.
The tigercpu part is an HPE Apollo cluster comprised of 408 Intel Skylake CPU nodes. Each CPU processor core has at least 4.8 GB of memory. Every 40-core node is interconnected by a Omnipath fabric with oversubscription. There are 24 nodes per chassis all connected with the full bandwidth.
The tigergpu part is a Dell computer cluster comprised of 320 NVIDIA P100 GPUs across 80 Broadwell nodes, each GPU processor core has 16 GB of memory. The nodes are interconnected by an Intel Omnipath fabric. Each GPU is on a dedicated x16 PCI bus. The nodes all have 2.9TB of NVMe connected scratch as well as 256G RAM. The CPUs are Intel Broadwell e5-2680v4 with 28 cores per node. Monitor GPU utilization using various tools such as gpudash and nvidia-smi.
For more hardware details, see the Hardware Configuration information below.
How to Access the Tiger Cluster
To use the Tiger cluster you have to request an account and then log in through SSH.
- Requesting Access to Tiger
Access to the large clusters like Tiger is granted on the basis of brief faculty-sponsored proposals (see For large clusters: Submit a proposal or contribute).
If, however, you are part of a research group with a faculty member who has contributed to or has an approved project on Tiger, that faculty member can sponsor additional users by sending a request to email@example.com. Any non-Princeton user must be sponsored by a Princeton faculty or staff member for a Research Computer User (RCU) account.
- Logging into Tiger
Once you have been granted access to Tiger, you can connect by opening an SSH client and using the SSH command as displayed below. Use tigercpu for CPU-only usage, and use tigergpu for added GPU support.
To log into tigercpu (VPN required from off-campus):
$ ssh <YourNetID>@tiger.princeton.edu
To log into tigergpu (VPN required from off-campus):
$ ssh <YourNetID>@tigergpu.princeton.edu
For more on how to SSH, see the Knowledge Base article Secure Shell (SSH): Frequently Asked Questions (FAQ). If you have trouble connecting then see our SSH page.
How to Use the Tiger Cluster
Since Tiger is a Linux system, knowing some basic Linux commands is highly recommended. For an introduction to navigating a Linux system, view the material associated with our Intro to Linux Command Line workshop.
Using Tiger also requires some knowledge on how to properly use the file system, module system, and how to use the scheduler that handles each user's jobs. For an introduction to navigating Princeton's High Performance Computing systems, view our Guide to Princeton's Research Computing Clusters. Additional information specific to Tiger's file system, priority for job scheduling, etc. can be found below.
Please remember that these are shared resources for all users.
The login nodes, tigercpu and tigergpu, should be used for interactive work only, such as compiling programs, and submitting jobs as described below. No jobs should be run on the login node other than brief tests that last no more than a few minutes. Where practical, we ask that you entirely fill the compute nodes so that CPU core fragmentation is minimized.
Jobs can be submitted for either portion of the Tiger system from either login node, but it is best to compile programs on the login node associated with the portion of the system where the program will run. That is, compile GPU jobs on tigergpu and non-GPU jobs on tigercpu. Running a job on the GPU nodes requires additional specifications in the job script. Refer to our Slurm page for instructions.
Dell Linux Cluster
|2.4 GHz Xeon Broadwell
|80||28||256 GB||2240||Omnipath||86 TFLOPS|
|(GPU info)||1328 MHz P100||4 GPU/node||16 GB/CPU||320 GPUs||Omnipath||1504 TFLOPS|
HPE Linux Cluster
|2.4 GHz Skylake.||408||40||192 GB or 768 GB||16320||Omnipath||>1103 TFLOPS|
Distribution of CPU and memory
There are 16,320 processors available, 40 per node. Each node contains at least 192 GB of memory (4.8 GB per core). The nodes are assembled into 24 node chassis where each chassis has a 1:1 Omnipath connection. There is oversubscription between chassis at 2:1.
There are also 40 nodes with memory of 768 GB (19 GB per core). These larger memory nodes also have SSD drives for faster I/O locally.
The nodes are all connected through Omnipath switches for MPI traffic, GPFS, and NFS I/O and over a Gigabit Ethernet for other communication.
For more technical details, click here to see the full version of the systems table.
Jobs are prioritized through the Slurm scheduler based on a number of factors: job size, run times, node availability, wait times, and percentage of usage over a 30 day period as well as a fairshare mechanism to provide access for large contributors. The policy below may change as the job mix changes on the machine.
Jobs will move to the test, vshort, short, medium, or long quality of service as determined by the scheduler. They are differentiated by the wallclock time requested as follows:
|QOS||Time Limit||Jobs per User||Cores per Job||Cores Available|
|tiger-vshort||5 hours||16||no limit||no limit|
|QOS||Time Limit||Jobs per User||GPUs per User|
|gpu-test||1 hour||2||no limit|
Note that the above numbers and limits may be changed if demand requires. Use the "qos" command to view the actual values in effect.
Tiger will be down for routine maintenance on the second Tuesday of every month from approximately 6 AM to 2 PM. This includes the associated filesystems of /scratch/gpfs, /projects and /tigress. Please mark your calendar. Jobs submitted close to downtime will remain in the queue unless they can be scheduled to finish before downtime (see more). Users will receive an email when the cluster is returned to service.
Filesystem Usage and Quotas
/home (shared via NFS to all the compute nodes) is intended for scripts, source code, executables and small static data sets that may be needed as standard input/configuration for codes.
/scratch/gpfs is intended for dynamic data that requires higher bandwidth I/O. Files are NOT backed up so this data should be moved to persistent storage as soon as it is no longer needed for computations. Please remove files on /scratch/gpfs that you no longer need.
/tigress (shared using GPFS) is intended for more persistent storage and should provide high bandwidth I/O (8 GB/s aggregate bandwidth for jobs across 16 or more nodes). Users are provided with a default quota of 512 GB when they request a directory in this storage, and that default can be increased by requesting more. We do ask people to consider what they really need, and to make sure they regularly clean out data that is no longer needed since this filesystem is shared by the users of all our systems. See /tigress Usage Guidelines for more information.
/tmp (local to each compute node) is intended for data local to each task of a job, and it should be cleaned out at the end of each job. This is the fastest storage for access.
Wording of Acknowledgement of Support and/or Use of Research Computing Resources
"The author(s) are pleased to acknowledge that the work reported on in this paper was substantially performed using the Princeton Research Computing resources at Princeton University which is consortium of groups led by the Princeton Institute for Computational Science and Engineering (PICSciE) and Office of Information Technology's Research Computing."
"The simulations presented in this article were performed on computational resources managed and supported by Princeton Research Computing, a consortium of groups including the Princeton Institute for Computational Science and Engineering (PICSciE) and the Office of Information Technology's High Performance Computing Center and Visualization Laboratory at Princeton University."