Guide to Princeton Clusters

Getting Started Guide

All users of the Princeton Research Computing Clusters are expected to have a basic working knowledge of the systems. This is important because a naive user might unknowingly waste resources or even adversely affect the work of others. The guide below has been prepared to help everyone use the clusters effectively. It covers the following topics:

  • What is an HPC cluster?
  • How to connect to the Princeton clusters
  • Step-by-step directions for running your first job
  • Using the Slurm job scheduler
  • Effective usage of resources
  • Installing software
  • Example jobs for popular applications

To view the guide choose either the GitHub or Video version below:

Getting Started with the Research Computing Clusters: GitHub Version | Video Version (2.5 hours)

 

Top 10 Mistakes to Avoid on the Research Computing Clusters

  1. Do not go over your storage quota. Exceeding your storage quota can lead to many problems including batch jobs failing, confusing error messages and the inability to use X11 forwarding. Be sure to routinely run the checkquota command to check your usage. If more space is needed then remove files or request a quota increase.
     
  2. Do not run jobs on the login nodes. When you connect to a cluster via SSH you land on the login node which is shared by all users. The login node is reserved for submitting jobs, compiling codes, installing software and running short tests that use only a few CPU-cores and finish within a few minutes. Anything more intensive must be submitted to the Slurm job scheduler as either a batch or interactive job. Failure to comply with this rule may result in your account being temporarily suspended.
     
  3. Do not write the output of actively running jobs to /tigress or /projects. The /tigress and /projects storage systems are shared by all of the large clusters via a single, slow connection and they are designed for non-volatile files only (i.e., files that do not change over time). Writing job output to these systems may adversely affect the work of other users and it may cause your job to run inefficiently or fail. Instead, write your output to the much faster /scratch/gpfs/<YourNetID> and then, after the job completes, copy or move the output to /tigress or /projects if a backup is desired. See Data Storage for more.
     
  4. Do not try to access the Internet from batch or interactive jobs. All jobs submitted to the Slurm job scheduler run on the compute nodes which do not have Internet access. This includes OnDemand sessions via MyAdroit and MyDella. Because of this, a running job cannot download files, install packages or connect to GitHub. You will need to perform these operations on the login node before submitting the job.
     
  5. Do not allocate more than one CPU-core for serial jobs. Serial codes cannot run in parallel so using more than one CPU-core will not cause the job to run faster. Instead, doing so will waste resources and cause your next job to have a lower priority. See the Slurm page for tips on figuring out if your code can run in parallel and for information about Job Arrays which allow one to run many jobs simultaneously.
     
  6. Do not run jobs with a parallel code without first conducting a scaling analysis. If your code runs in parallel then you need to determine the optimal number of nodes and CPU-cores to use. The same is true if it can use multiple GPUs. To do this, perform a scaling analysis as described in Choosing the Number of Nodes, CPU-cores and GPUs.
     
  7. Do not request a GPU for a code that can only use CPUs. Only codes that have been explicitly written to use GPUs can take advantage of GPUs. Allocating a GPU for a CPU-only code will not speed-up the execution time but it will increase your queue time, waste resources and lower the priority of your next job submission. Furthermore, some codes are written to use only a single GPU. For more see GPU Computing and Choosing the Number of Nodes, CPU-cores and GPUs.
     
  8. Do not use the system GCC when a newer version is needed. The GNU Compiler Collection (GCC) provides a suite of compilers and related tools. The system version of GCC on most clusters is 4.8.5. When you need to compile code using a newer version of gcc, g++ or gfortran then load one of the rh/devtoolset environment modules before doing so. This is essential for building R packages, for instance. See the Software page of the guide above for more. Traverse, Stellar and Della-GPU run either the RHEL 8 or SDL 8 operating systems which provide GCC 8.3.1 as the system version so one does not need to load an environment module.
     
  9. Do not load environment modules using only the partial name. A common practice for Python users is to issue the "module load anaconda3" command. You should always specify the full name of the environment module (e.g., module load anaconda3/2020.11) and on some clusters failing to do so will result in an error. Also, you should avoid loading environment modules in your ~/.bashrc file. Instead, do this in Slurm scripts and on the command line when needed.
     
  10. Do not waste your time trying to solve software or job scheduler problems. Research Computing is composed of staff members with a broad range of experience. If you encounter a problem that is proving difficult to solve then please see How to Get Help. Additionally, to improve your HPC skills be sure to attend the workshops and training sessions that we offer throughout the year.

 

Tiger Cluster

Tiger cluster in the High Performance Computing Research Center. Photo credit: Floe Fusin-Wischusen, PICSciE