Top 10 Mistakes to Avoid on the Research Computing Clusters

Tiger Cluster

Tiger cluster in the High Performance Computing Research Center. Photo credit: Floe Fusin-Wischusen, PICSciE

  1. Do not go over your storage quota. Exceeding your storage quota can lead to many problems including batch jobs failing, confusing error messages and the inability to use X11 forwarding. Be sure to routinely run the checkquota command to check your usage. If more space is needed then remove files or request a quota increase.
  2. Do not run jobs on the login nodes. When you connect to a cluster via SSH you land on the login node which is shared by all users. The login node is reserved for submitting jobs, compiling codes, installing software and running short tests that use only a few CPU-cores and finish within a few minutes. Anything more intensive must be submitted to the Slurm job scheduler as either a batch or interactive job. Failure to comply with this rule may result in your account being temporarily suspended.
  3. Do not write the output of actively running jobs to /tigress or /projects. The /tigress and /projects storage systems are shared by all of the large clusters via a single, slow connection and they are designed for non-volatile files only (i.e., files that do not change over time). Writing job output to these systems may adversely affect the work of other users and it may cause your job to run inefficiently or fail. Instead, write your output to the much faster /scratch/gpfs/<YourNetID> and then, after the job completes, copy or move the output to /projects if a backup is desired. See Data Storage for more.
  4. Do not try to access the Internet from batch or interactive jobs. All jobs submitted to the Slurm job scheduler run on the compute nodes which do not have Internet access. This includes OnDemand sessions via MyAdroit and MyDella. Because of this, a running job cannot download files, install packages or connect to GitHub. You will need to perform these operations on the login node or a visualization node before submitting the job.
  5. Do not allocate excessive amounts of CPU memory. Please use accurate values for the --mem-per-cpu or --mem Slurm directives. Allocating excess CPU memory will unnecessarily increase your queue times and it could make multiple CPU-cores unavailable for other jobs to use. Learn how to allocate memory using Slurm.
  6. Do not allocate more than one CPU-core for serial jobs. Serial codes cannot run in parallel so using more than one CPU-core will not cause the job to run faster. Instead, doing so will waste resources and cause your next job to have a lower priority. See the Slurm page for tips on figuring out if your code can run in parallel and for information about Job Arrays which allow one to run many jobs simultaneously.
  7. Do not run jobs with a parallel code without first conducting a scaling analysis. If your code runs in parallel then you need to determine the optimal number of nodes and CPU-cores to use. The same is true if it can use multiple GPUs. To do this, perform a scaling analysis as described in Choosing the Number of Nodes, CPU-cores and GPUs.
  8. Do not request a GPU for a code that can only use CPUs. Only codes that have been explicitly written to use GPUs can take advantage of GPUs. Allocating a GPU for a CPU-only code will not speed-up the execution time but it will increase your queue time, waste resources and lower the priority of your next job submission. Furthermore, some codes are written to use only a single GPU. For more see GPU Computing and Choosing the Number of Nodes, CPU-cores and GPUs.
  9. Do not use the system GCC when a newer version is needed. The GNU Compiler Collection (GCC) provides a suite of compilers (gcc, g++, gfortran) and related tools. You can determine the version of the system GCC by running the command "gcc --version". If the system version is insufficient then load the appropriate environment module to make a newer version available. On Della, Stellar and Traverse run the command "module avail gcc-toolset" to see the available modules. On Tiger use "module avail rh". Learn more about environment modules.
  10. Do not waste your time trying to solve software or job scheduler problems. Research Computing is composed of staff members with a broad range of experience. If you encounter a problem that is proving difficult to solve then please see How to Get Help. Additionally, to improve your HPC skills be sure to attend the workshops and training sessions that we offer throughout the year.