Back in the day, the machines used for high performance computing were known as "supercomputers," or big standalone machines with specialized hardware–very different from what you would find in home and office computers.
Nowadays, however, the majority of supercomputers are instead computer clusters (or just "clusters" for short) --- collections of relatively low-cost standalone computers that are networked together. These inter-connected computers are endowed with software to coordinate programs on (or across) those computers, and they can therefore work together to perform computationally intensive tasks.
The computational systems made available by Princeton Research Computing are, for the most part, clusters. Each computer in the cluster is called a node (the term "node" comes from graph theory), and we commonly talk about two types of nodes: head node and compute nodes.
- Head Node - The head node is the computer where we land when we log in to the cluster. This is where we edit scripts, compile code, and submit jobs to the scheduler. The head nodes are shared with other users and jobs should not be run on the head nodes themselves.
- Compute Node - The compute nodes are the computers where jobs should be run. In order to run jobs on the compute nodes we must go through the job scheduler. By submitting jobs to the job scheduler, the jobs will automatically be run on the compute nodes once the requested resources are available. Princeton's Research Computing clusters use SLURM as their scheduling program and we will get back to this in a later section.
- Cores - A shorthand way to refer to the number of processor CPU-cores (usually physical) of a CPU-chip in a node.
How Do Princeton's HPC Clusters Work?
To have your program run on the clusters, you can start a job on the head node. A job consists of the the following files:
- your code that runs your program
- a separate script, known as a SLURM script, that will request the resources your job requires in terms of the amount of memory, the number of cores, number of nodes, etc. As mentioned previously, Princeton Research Computing uses a scheduler called SLURM, which is why this script is referred to as your SLURM script.
Once your files are submitted, the scheduler (SLURM) takes care of figuring out if the resources you requested are available on the compute nodes, and if not it will start reserving those resources for you. Once resources become available, the scheduler runs your program on the compute nodes.
Important Notes on Using Princeton's HPC Clusters
The 10-10 Rule.
First, it's important to know that you may run test jobs on the head nodes that run for up to 10 minutes and use up to 10% of the CPU cores and memory. You will likely disrupt the work of others if you exceed these limits, and you may be contaced by our system administrators if you exceed these rules.
No Internet Access on the Compute Nodes
Second, it's important to know that there is no internet access on the compute nodes. This is for security reasons. This means that when you submit your job (your program + your slurm script) to be run on the cluster, those jobs cannot involve any steps that require an internet connection to work. For example, downloading data from a site, scraping websites, downloading packages, etc., will not work on the compute nodes. You need to make sure all needed files are present on the cluster before submitting the job to the scheduler.
Further Details on Princeton's Clusters
Each of Princeton's Research Computing clusters typically has a small number of login nodes–usually one or two–and a large number of compute nodes. For example, Princeton's Della cluster has a few hundred compute nodes.
Each node in a cluster typically contains one or more (usually two) processors, which we'll refer to as CPU-chips. The CPU-chip, is connected to the memory (RAM) as well as having connections to other devices like GPUs or networking cards. Most importantly, each CPU-chip has multiple CPU-cores on it, or mini-processors. These cores run the actual computations. A node with 2 CPU-chips and 16 CPU-cores per CPU-chip can be used to carry out 32 tasks simultaneously.
A job can be run on a single CPU-core, but if your code is modified to support parallel operations, a job may also run on multiple CPU-cores on one or more CPU-chips, and even on multiple nodes at one time. See: Compiling and Running MPI Jobs.
The nodes in each of Princeton's clusters are connected to one-another by a high-speed, low-latency network, either InfiniBand or Omni-Path. This allows for parallel operations across nodes.
All of Princeton's Research Computing clusters have GPU nodes which are nodes with both CPU-chips and GPUs. See: What is a GPU?
Princeton's Research Computing clusters run the Linux operating system. More specifically, they run Springdale Linux, which is a customized version of Red Hat Enterprise Linux. Springdale is a Linux distribution maintained by members of Princeton University and the Institute for Advanced Study.
Why Use Princeton's HPC Clusters?
In a nutshell, Princeton's clusters are here for the moment your personal computer can no longer handle the computations you need to get done.
Some concrete advantages of using Princeton's clusters are:
- Lots of processing capacity, ability to do parallel computing (e.g., you could use 1000 CPU-cores for a single job)
- Lots of memory (i.e., nodes have 100's GB of memory and in one case greater than 6 TB)
- Ability to work with large datasets (on '/scratch/gpfs', see our page on Data Storage)
- Lots of software is available and already configured (e.g., MPI, compilers, commercial software)
- Keep your laptop free to use by running your work on the clusters
- GPUs are available on all of the clusters (except Nobel)
- There is a team of people maintaining and supporting the clusters