Della

OUTLINE

 

IMPORTANT UPDATE: The Della operating system upgrade from SDL 7 to 8 is now complete. All the nodes are now running SDL 8. Users will need to recompile their codes using new environment modules to run on the upgraded nodes.

Full environment module names are required for SDL 8:

module load anaconda3          # failure
module load anaconda3/2021.11  # success

If you encounter an "illegal instruction" error then see the bottom of this page.

For those in Physics: You have a login node called della-feynman which running SDL 8. Since this should only be used by folks in physics, this is an ideal place to login instead of della8. All physics nodes are running SDL 8. Remember to add the line "#SBATCH -p physics" to your Slurm scipts to access the nodes available otherwise you will wind up in the general pool.

 

Overview

Della is intended as a platform for running both parallel and serial production jobs.  The system has grown over time and includes groups of nodes using different generations of Intel processor technology. The cluster features 20 AMD nodes with two NVIDIA A100 GPUs per node.

Some Technical Specifications:
Della is a 4-rack Intel computer cluster, originally acquired through a joint effort of Astrophysics, the Lewis-Sigler Institute for Integrative Genomics, PICSciE and OIT. All  nodes are connected via FDR Infiniband high bandwidth low latency network. For more hardware details, see the Hardware Configuration information below.

 

How to Access the Della Cluster

To use the Della cluster you have to request an account and then log in through SSH.

  1. Requesting Access to Della

    Access to the large clusters like Della is granted on the basis of brief faculty-sponsored proposals (see For large clusters: Submit a proposal or contribute).

    If, however, you are part of a research group with a faculty member who has contributed to or has an approved project on Della, that faculty member can sponsor additional users by sending a request to cses@princeton.edu. Any non-Princeton user must be sponsored by a Princeton faculty or staff member for a Research Computer User (RCU) account.

  2. Logging into Della

    Once you have been granted access to Della, you can connect by opening an SSH client and using the SSH command:

    For CPU or GPUs jobs using the Springdale Linux 8 operating system (VPN required from off-campus)
    $ ssh <YourNetID>@della.princeton.edu
    
    For GPU jobs (VPN required from off-campus)
    $ ssh <YourNetID>@della-gpu.princeton.edu
    
    For more on how to SSH, see the Knowledge Base article Secure Shell (SSH): Frequently Asked Questions (FAQ). If you have trouble connecting then see our SSH page.

    If you prefer to navigate Della through a graphical user interface rather than the Linux command line, there is also a web portal called MyDella (VPN required from off-campus):
    https://mydella.princeton.edu

    MyDella provides access to the cluster through a web browser. This enables easy file transfers and interactive jobs: RStudio, Jupyter, Stata and MATLAB. A graphical desktop environment is also available on the visualization nodes.

Della OnDemand

How to Use the Della Cluster

Since Della is a Linux system, knowing some basic Linux commands is highly recommended. For an introduction to navigating a Linux system, view the material associated with our Intro to Linux Command Line workshop. 

Using Della also requires some knowledge on how to properly use the file system, module system, and how to use the scheduler that handles each user's jobs. For an introduction to navigating Princeton's High Performance Computing systems, view our Guide to Princeton's Research Computing Clusters. Additional information specific to Della's file system, priority for job scheduling, etc. can be found below.

To attend a live session of either workshop, see our Trainings page for the next available workshop.
For more resources, see our Support - How to Get Help page.

 

Important Guidelines

The login nodes, della8 and della-gpu, should be used for interactive work only, such as compiling programs and submitting jobs as described below. No jobs should be run on the login node, other than brief tests that last no more than a few minutes and only use a few CPU-cores. If you'd like to run a Jupyter notebook, we have a few options for running Jupyter notebooks so that you can avoid running on the login nodes.

Where practical, we ask that you entirely fill the nodes so that CPU-core fragmentation is minimized.

 

Hardware Configuration

Della is composed of both CPU and GPU nodes:

Processor Nodes Cores per Node Memory per Node Max Instruction Set GPUs per Node
2.4 GHz Intel Broadwell 96 28 128 GB AVX2 N/A
2.6 GHz Intel Skylake 64 32 190 GB AVX-512 N/A
2.8 GHz Intel Cascade Lake 64 32 190 GB AVX-512 N/A
2.6 GHz AMD EPYC Rome 20 128 768 GB AVX2 2

Each GPU has 40 GB of memory. The nodes of Della are connected with FDR Infiniband. Run the "shownodes" command for additional information about the nodes. Note that there are some private nodes that belong to specific departments or research groups. If you are in the "physics" Unix group then use #SBATCH --partition=physics to access private nodes. There are a small number of large-memory nodes (>1.5 TB) available to all users that are not mentioned in the table above. These may only be used for jobs that utilize more than 190 GB of memory. Please write to cses@princeton.edu for more information. For more technical details about the Della cluster, see the full version of the systems table.

 

FPGAs

Della provides two nodes each with two Field-Programmable Gate Arrays (FPGAs) by Intel (Stratix 10). If you have an account on Della then use this command to connect:

$ ssh <YourNetID>@della-fpga2.princeton.edu

The node della-fpga1.princeton.edu is also available. See this Getting Started Guide by Bei Wang which provides a general introduction as well as a "hello world" example using OpenCL.

Consider watching "FPGA Programming for Software Developers Using oneAPI" by Intel.

 

Job Scheduling (QOS Parameters)

All jobs must be run through the Slurm scheduler on Della. If a job would exceed any of the limits below, it will be held until it is eligible to run. Jobs should not specify the qos into which it should run, allowing the Slurm scheduler to distribute the jobs accordingly.

Jobs that run on the CPU nodes will be assigned a quality of service (QOS) according to the length of time specified for the job:

CPU Jobs

QOS Time Limit Jobs per User Cores per User Cores Available
test 61 minutes 2 jobs [30 nodes] no limit
short 24 hours 300 jobs 300 cores no limit
medium 72 hours 100 jobs 250 cores 2000 cores
vlong 144 hours
(6 days)
40 jobs 160 cores 1350 cores

Use the "qos" command to see the latest values for the table above.

GPU Jobs

QOS Jobs per User GPUs per User
della-gpu 7 10

Use the "qos" command to see the latest values for the table above. The time limit for GPU jobs is equal to the maximum time limit for CPU jobs.

Jobs are further prioritized through the Slurm scheduler based on a number of factors: job size, run times, node availability, wait times, and percentage of usage over a 30 day period (fairshare). Also, these values reflect the minimum limits in effect and the actual values may be higher. Please use the "qos" command to see the limits in effect at the current time.

 

Running on Specific CPUs

The CPU nodes of Della feature different generations of Intel CPUs. The newer nodes (cascade) are faster than the older nodes (broadwell) with skylake sitting in between. This can often explain why the execution time of your code varies. To see the CPU generation of each node, run the "shownodes" command and look at the FEATURES column. The newest CPU generation determines the node type. For example, "cascade,skylake,broadwell" is cascade. To cause your jobs to run only on the new cascade nodes, add this line to your Slurm script:

#SBATCH --constraint=cascade

Note that to only run on skylake or broadwell you will need to use the --nodelist directive since cascade nodes are set to also run jobs specified for lower generations.

 

GPU Jobs

Della provides 20 AMD EPYC Rome nodes with 2 NVIDIA A100 GPUs per node. The login node to the GPU portion of Della is della-gpu.princeton.edu. Be aware of the following:

  • All GPU jobs must be submitted from the della-gpu login node or della8.
  • If you are compiling from source then this must be done on the della-gpu login node.
  • The GPU nodes are for GPU jobs only. Do not run CPU jobs on the GPU nodes.

To connect to the login node:

$ ssh <YourNetID>@della-gpu.princeton.edu

Be aware that della-gpu and the GPU compute nodes are running a variant of the RHEL 8 operating system. You will notice that the environment modules are different (see "module avail") and the system version of GCC is 8.3.1. PNI has priority access to these nodes since they provided the funding.

Della GPU

The figure above shows the login node and 20 compute nodes. Each compute node has 2 AMD EPYC CPUs and 2 NVIDIA A100 GPUs. Each CPU has 64 CPU-cores. Users do not need to concern themselves with the details of the CPUs or the CPU memory. Simply choose up to 128 CPU-cores per node and up to 768 GB of memory per node in your Slurm script. The compute nodes are interconnected using FDR Infiniband making multinode jobs possible.

Each A100 GPU has 40 GB of memory and 6912 CUDA cores (FP32) across 108 Streaming Multiprocessors. Make sure you are using version 11.x of the CUDA Toolkit and version 8.x of cuDNN when possible. Not all codes will be able to take full advantage of these powerful accelerators. See the GPU Computing page to learn how to monitor GPU utilization using tools like nvidia-smi and gpustat. See an example Slurm script for a GPU job.

To see your GPU utilization every 10 minutes over the last hour, run this command:

$ gpudash -u $USER

One can run "gpudash" without options to see the utilization across the entire cluster. You can also directly examine your GPU utilization in real time.

To see the number of available GPUs, run this command:

$ shownodes -p gpu

PyTorch

PyTorch can take advantage of mixed-precision training on the A100 GPUs. Follow these directions to install PyTorch on della-gpu. Then see the docs on AMP. You should also try using a DataLoader with multiple CPU-cores. For more ways to optimize your PyTorch jobs see "PyTorch Performance Tuning Guide" from GTC 2021.

TensorFlow

TensorFlow can take advantage of mixed-precision training on the A100 GPUs. Follow these directions to install TensorFlow on della-gpu. Then see the Mixed Precision Guide. You should also try using a data loader from tf.data to keep the GPU busy (watch YouTube video). For more ways to optimize your TensorFlow jobs see "Train Faster: A Guide to Tensorflow Performance Optimization" from GTC 2021.

Compiling from Source

All GPU codes must be compiled on the della-gpu login node. When compiling, one should prefer the cudatoolkit/11.x modules. The A100 GPU has a compute capability of 8.0. To compile a CUDA kernel one might use the following commands:

$ module load cudatoolkit/11.3
$ nvcc -O3 -arch=sm_80 -o myapp myapp.cu

The compute capability should also be specified when building codes using CMake. For example, for LAMMPS and GROMACS one would use -DGPU_ARCH=sm_80 and -DGMX_CUDA_TARGET_SM=80, respectively.

For tips on compiling the CPU code of your GPU-enabled software see "AMD Nodes" on the Stellar page.

CUDA Multi-Process Service (MPS)

Certain MPI codes that use GPUs may benefit from CUDA MPS (see ORNL docs), which enables multiple processes to concurrently share the resources on a single GPU. To use MPS simply add this directive to your Slurm script:

#SBATCH --gpu-mps

In most cases users will see no speed-up. Codes where the individual MPI processes underutilize the GPU should see a performance gain.

 

Running Software using the Previous Operating System

The operating system on most of the nodes of Della was upgraded from SDL 7 to SDL 8 in the winter of 2022. Users should reinstall or recompile their codes on the della8 login node. When this is not possible, we provide a compatibility tool for effectively running software under the old operating system (SDL 7). This involves prepending the command you want to run with /usr/licensed/bin/run7. Below are a few examples:

$ /usr/licensed/bin/run7 cat /etc/os-release
$ /usr/licensed/bin/run7 bash
Singularity> source /home/aturing/env.sh
Singularity> solar -i 42 -d /scratch/gpfs/aturing/output

 

Visualization Nodes

The Della cluster includes della-vis2 which is a dedicated node for visualization and post-processing tasks. Connect via SSH with the following command (VPN required from off-campus):

$ ssh <YourNetID>@della-vis2.princeton.edu

This node features 28 CPU-cores and 512 GB of memory. There is no job scheduler on della-vis2 so users must be mindful of the resources they are consuming. In addition to visualization, the node can be used for tasks that are incompatiable with the Slurm job scheduler or for work that is not appropriate for the Della login nodes such as downloading large amounts of data from the Internet.

Jupyter

Jupyter is available on https://mydella.princeton.edu (VPN required from off-campus) by choosing "Interactive Apps" then either "Jupyter" or "Jupyter on Della Vis2". The first choice is for intensive sessions while the latter runs on della-vis2 which is for light work that requires internet access. For creating Conda environment and more, see our general page for Jupyter.

Virtual Desktop

A virtual desktop is available on https://mydella.princeton.edu (VPN required from off-campus) by choosing "Interactive Apps" then "Della Vis2 Desktop". This allows users to easily work with graphical applications. There are also other approaches.

Desktop

 

Filesystem Usage and Quotas

/home (shared via NFS to all the compute nodes) is intended for scripts, source code, executables and small static data sets that may be needed as input for codes.

/scratch/gpfs (shared via GPFS to all the compute nodes) is intended for dynamic data that requires high-bandwidth I/O. Files on /scratch/gpfs are NOT backed up so data should be moved to persistent storage as soon as it is no longer needed for computations.

/tigress and /projects are intended for more persistent storage (20 GB/s aggregate bandwidth for jobs across 16 or more nodes). Users are provided with a default quota of 512 GB when they request a directory in this storage, and that default can be increased by requesting more. We do ask people to consider what they really need, and to make sure they regularly clean out data that is no longer needed since these storage systems are shared by the users of all our systems.

/tmp is local scratch space available on each compute node. It is the fastest filesystem. Nodes have about 130 to 1400 GB of space available on /tmp. Learn more about local scratch.

See the Data Storage page for complete details about the file and storage systems.
 

Running Third-party Software

If you are running 3rd-party software whose characteristics (e.g., memory usage) you are unfamiliar with, please check your job after 5-15 minutes using 'top' or 'ps -ef' on the compute nodes being used. If the memory usage is growing rapidly, or close to exceeding the per-processor memory limit, you should terminate your job before it causes the system to hang or crash. You can determine on which node(s) your job is running using the "scontrol show job <jobnumber>" command.


Maintenance Window

Della will be down for routine maintenance on the second Tuesday of every month from approximately 6 AM to 2 PM. This includes the associated filesystems of /scratch/gpfs, /projects and /tigress. Please mark your calendar. Jobs submitted close to downtime will remain in the queue unless they can be scheduled to finish before downtime (see more). Users will receive an email when the cluster is returned to service.

 

Wording of Acknowledgement of Support and/or Use of Research Computing Resources

"The author(s) are pleased to acknowledge that the work reported on in this paper was substantially performed using the Princeton Research Computing resources at Princeton University which is consortium of groups led by the Princeton Institute for Computational Science and Engineering (PICSciE) and Office of Information Technology's Research Computing."

"The simulations presented in this article were performed on computational resources managed and supported by Princeton Research Computing, a consortium of groups including the Princeton Institute for Computational Science and Engineering (PICSciE) and the Office of Information Technology's High Performance Computing Center and Visualization Laboratory at Princeton University."