What is a cluster?

A cluster is a group of inter-connected computers that work together to perform computationally intensive tasks.  In a cluster, each computer is referred to as a "node".  (The term "node" comes from graph theory.)

A cluster has a small number of "head nodes", usually one or two, and a large number of "compute nodes". For example, the della cluster has 240 compute nodes.  The head node is the computer to which you log in, and where you edit scripts, compile code, and submit jobs.  Your jobs are automatically run on the compute nodes by the scheduling program "SLURM" -- see: Introducing SLURM.

Each node contains one or more processors or CPUs (usually two) on which computation takes place.  Each processor has multiple "cores".  For example, the newest of the della nodes each contains two processors, and each processor has 14 cores, for a total of 28 cores in each node.  This means that one of these della nodes can perform 28 tasks simultaneously.

A job can be run on a single core.  Assuming the software is written to support parallel operations, a job may also run on multiple cores on one or more processors, and even on multiple nodes at one time.  See: Compiling and Running MPI Jobs.

The nodes in each of our clusters are connected to one-another by high-speed, low-latency networks, either InfiniBand or Omni-Path.  This helps to support parallel operations across nodes.

The tiger cluster has nodes which, in addition to their CPU processors, also have GPUs.  See: What is a GPU?

Our clusters run the Linux operating system.  More specifically, they run Springdale Linux, which is a customized version of Red Hat Enterprise Linux.  Springdale is a Linux distribution maintained by members of the computing staffs of ​Princeton University and the ​Institute for Advanced Study.