IMPORTANT: Over the next few months, an operating system upgrade (version 8 to 9) will take place on the Della cluster. Any codes that were built from source will need to be rebuilt to run on version 9. At the same time, more than $1M dollars worth of AMD CPU nodes are being added to the cluster. Please begin using the new AMD nodes by rebuilding your codes on della9 and submitting jobs from della9:$ ssh <YourNetID>@della9.princeton.edu If you have trouble connecting to della9 then try using a VPN. GPU nodes with version 9 of the operating system are available. The login node della-gpu.princeton.edu is now running version 9 of the operating system. For more details about the transition see Transition from Della 8 to Della 9 below.OverviewDella is a general-purpose cluster for running serial and parallel production jobs. The cluster features both CPU and GPU nodes.How to AccessTo use the Della cluster you have to request an account and then log in through SSH.Requesting Access to DellaAccess to the large clusters like Della is granted on the basis of brief faculty-sponsored proposals (see For large clusters: Submit a proposal or contribute).If, however, you are part of a research group with a faculty member who has contributed to or has an approved project on Della, that faculty member can sponsor additional users by sending a request to [email protected]. Any non-Princeton user must be sponsored by a Princeton faculty or staff member for a Research Computer User (RCU) account.Logging into DellaOption 1Once you have been granted access to Della, you can connect by opening an SSH client and using the SSH command:For CPU or GPUs jobs using the Springdale Linux 8 operating system (VPN required from off-campus)$ ssh <YourNetID>@della.princeton.eduFor GPU jobs (VPN required from off-campus)$ ssh <YourNetID>@della-gpu.princeton.eduFor more on how to SSH, see the Knowledge Base article Secure Shell (SSH): Frequently Asked Questions (FAQ). If you have trouble connecting then see our SSH page.Option 2If you prefer to navigate Della through a graphical user interface rather than the Linux command line, there is also a web portal called MyDella (VPN required from off-campus):https://mydella.princeton.eduMyDella provides access to the cluster through a web browser. This enables easy file transfers and interactive jobs: RStudio, Jupyter, Stata and MATLAB. To work with visualizations, or applications that require graphical user interfaces (GUIs), use Della's visualization nodes instead. Data StorageThe schematic diagram below shows the filesystems that are available on Della: See the Data Storage page to learn how to use each filesystem. Note that not all login nodes are shown in the figure above.How to UseSince Della is a Linux system, knowing some basic Linux commands is highly recommended. For an introduction to navigating a Linux system, view the material associated with our Intro to Linux Command Line workshop. Using Della also requires some knowledge on how to properly use the file system, module system, and how to use the scheduler that handles each user's jobs. For an introduction to navigating Princeton's High Performance Computing systems, view our Guide to Princeton's Research Computing Clusters. Additional information specific to Della's file system, priority for job scheduling, etc. can be found below.To work with visualizations, or applications that require graphical user interfaces (GUIs), use Della's visualization nodes.To attend a live session of either workshop, see our Trainings page for the next available workshop.For more resources, see our Support - How to Get Help page.Important GuidelinesThe login nodes, della8 and della-gpu, should be used for interactive work only, such as compiling programs and submitting jobs as described below. No jobs should be run on the login node, other than brief tests that last no more than a few minutes and only use a few CPU-cores. If you'd like to run a Jupyter notebook, we have a few options for running Jupyter notebooks so that you can avoid running on the login nodes.Where practical, we ask that you entirely fill the nodes so that CPU-core fragmentation is minimized.Hardware ConfigurationDella is composed of both CPU and GPU nodes:ProcessorNodesCores per NodeMemory per NodeMax Instruction SetGPUs per Node2.4 GHz AMD EPYC 9654181921500 GBAVX-512 (2 cycles)N/A2.8 GHz Intel Cascade Lake6432190 GBAVX-512N/A2.6 GHz AMD EPYC Rome20128768 GBAVX22 (A100)2.8 GHz Intel Ice Lake69481000 GBAVX-5124 (A100)2.8 GHz Intel Ice Lake2481000 GBAVX-51228 (MIG)2.8 ARM Neoverse-V2172575 GB--1 (GH200)*2.1 GHz Intel Sapphire Rapids42961000 GBAVX-5128 (H100)**Each GPU has either 10 GB, 40 GB or 80 GB of memory (see GPU Jobs below for more). The nodes of Della are connected with FDR Infiniband. Run the "shownodes" command for additional information about the nodes. Note that there are some private nodes that belong to specific departments or research groups. For more technical details about the Della cluster, see the full version of the systems table. *There is one login node and one compute node with the Grace Hopper Superchip. **The H100 GPUs are only available to PLI members.Large-Memory NodesDella also has 14 large-memory nodes that were purchased by CSML and are available to all users:Number of NodesMemory per NodeCores per Node11510 GB4812000 GB56103080 GB9636150 GB96The large-memory nodes may only be used for jobs that require large memory. Your jobs will run on these nodes if you request more memory than what is available on the regular nodes. The large-memory nodes are monitored to ensure that only large-memory jobs run there. Use the "jobstats" command and Slurm email reports to see the memory usage of your jobs.To see which nodes are available, run this command:$ shownodes -p datascienceAll of the large-memory nodes feature Intel Cascade Lake CPUs which support the AVX-512 instruction set.Job Scheduling (QOS Parameters)All jobs must be run through the Slurm scheduler on Della. If a job would exceed any of the limits below, it will be held until it is eligible to run. Jobs should not specify the qos into which it should run, allowing the Slurm scheduler to distribute the jobs accordingly.Jobs that run on the CPU nodes will be assigned a quality of service (QOS) according to the length of time specified for the job:CPU JobsQOSTime LimitJobs per UserCores per UserCores Availabletest61 minutes2 jobs[30 nodes]no limitshort24 hours300 jobs300 coresno limitmedium72 hours100 jobs250 cores2000 coresvlong144 hours(6 days)40 jobs160 cores1350 coresUse the "qos" command to see the latest values for the table above.GPU JobsQOSTime LimitJobs per UserNodes per UserGPUs per Usergpu-test61 minutes2 jobsno limitno limitgpu-short24 hours30 jobs3035gpu-medium72 hours24 jobs2424gpu-long144 hours(6 days)7 jobs1616Use the "qos" command to see the latest values for the table above. Jobs that run on the "mig" partition use the same QOSes as those above.Jobs are further prioritized through the Slurm scheduler based on a number of factors: job size, run times, node availability, wait times, and percentage of usage over a 30 day period (fairshare). Also, these values reflect the minimum limits in effect and the actual values may be higher.Transition from Della 8 to Della 9In the next few months, all of the nodes on the Della cluster will be upgraded from the RHEL 8 to RHEL 9 operating system. Many nodes have already been upgraded. To see the nodes running RHEL 9, run this command:$ shownodes | grep rh9 CPU jobs submitted from the della9 login node will run on the RHEL 9 nodes:$ ssh <YourNetID>@della9.princeton.edu If you have trouble connecting to della9 then use the university VPN. Note that the new nodes feature AMD CPUs while the RHEL 8 nodes were Intel. This has important implications for rebuilding your software to run on the new RHEL 9 nodes.Rebuilding Software for Della 9Users will need to rebuild their software on della9 or della-gpu to run on the new RHEL 9 nodes in most cases. A set of guidelines are listed below:In general, codes that were compiled from source should be rebuilt (e.g., C, C++, Fortran)Codes that use MPI need to be recompiled from source (this includes Python, R and Julia)R users should remove their existing packages and reinstall them on della9 (use “module load R/4.4.2”)STATA users should remove and reinstall their packages on della9Python environments created using pip will need to be reinstalled if certain packages were built from sourcePython environments created using conda are likely to run without issueThere is nothing to be done for MATLAB usersWhat if a code built from source on Della 8 seems to run successfully Della 9? The code should still be recompiled on della9 or della-gpu since it will likely have better performance. This is especially true if the software compilation is optimized for the new AMD CPUs.Optimizing Code for AMD CPUsThe AMD nodes feature the EPYC 9654 processor with AVX-512 as the highest instruction set. See the Quick Reference Guide by AMD for compiler flags for different compilers (AOCC, GCC, Intel) and the AOCC user guide.Here is an example of compiling a serial C++ code using the AMD compilers and AMD math libraries:$ ssh <YourNetID>@della9.princeton.edu $ module load aocc/5.0.0 # AMD compilers (clang, clang++, flang) $ module load aocl/aocc/5.0.0 # AMD math libraries $ clang++ -Ofast -march=native -o mycode mycode.cppFor a C code that was parallelized using MPI:$ ssh <YourNetID>@della9.princeton.edu $ module load aocc/5.0.0 $ module load aocl/aocc/5.0.0 $ module load openmpi/aocc-5.0.0/4.1.6 # Open MPI built with AMD compilers $ mpicc -Ofast -march=native -o hello_world_mpi hello_world_mpi.cTo see the location of the AMD math libraries, run the command below and look at LIBRARY_PATH:$ module show aocl/aocc/5.0.0Load the aocl module to make available the BLIS and libFLAME linear algebra libraries by AMD as well as FFTW3 and ScaLAPACK.Be aware of environment modules like these below which are optimized for the AMD CPU nodes:fftw/aocc-5.0.0/3.3.10 fftw/aocc-5.0.0/openmpi-4.1.6/3.3.10 hdf5/aocc-5.0.0/1.14.4 hdf5/aocc-5.0.0/openmpi-4.1.6/1.14.4 netcdf/aocc-5.0.0/hdf5-1.14.4/4.9.2 netcdf/aocc-5.0.0/hdf5-1.14.4/openmpi-4.1.6/4.9.2 What is the difference between aocl/aocc/5.0.0 and aocl/aocc/5.0.0-ILP? The term ILP64 denotes that integer, long, and pointer data entities all occupy 8 bytes whereas, with LP64, integer, long, and pointer data are 4, 8, and 8 bytes, respectively. Use the aocl/aocc/5.0.0-ILP module and its corresponding libraries when you are working with data structures (e.g., array) with integer values exceeding 2^31 = 2147483648. In general, users should prefer the aocl module without ILP64.Intel Compilers and Libraries for AMD CPUsWhile della9 and the new CPU nodes feature AMD CPUs, users can still compile their code using the Intel compilers and libraries. See an example of AMD versus Intel for the LAMMPS simulation code. Users can encounter errors if the Intel compilers include instructions that are not available on the AMD CPUs:Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2, MOVBE, POPCNT, AVX, F16C, FMA, BMI, LZCNT, AVX2, AVX512F, AVX512DQ, ADX, AVX512CD, AVX512BW, AVX512VL, AVX512VBMI, AVX512_VPOPCNTDQ, AVX512_BITALG, AVX512_VBMI2, AVX512_VNNI and SHSTK instructions.GPU CodesGPU nodes running the RHEL 9 operating system are available. Use the della-gpu.princeton.edu login node for rebuilding GPU codes.Running on Specific CPUsThe CPU nodes of Della feature different generations of Intel CPUs. This can at times explain why the execution time of your code varies. To see the CPU generation of each node, run the "shownodes" command and look at the FEATURES column. The newest CPU generation determines the node type.Illegal Instruction ErrorsSome of the compute nodes on Della support a lower instruction set than the login nodes. This means that if you optimize your compiled code for the login nodes, you may encounter "illegal instruction" errors when running on certain compute nodes.CPU JobsIf you encounter an error such the following when running on the CPU nodes:Illegal instruction: illegal operand Illegal instruction (core dumped) Please verify that both the operating system and the processor support Intel(R) AVX512F, AVX512DQ, AVX512CD, AVX512BW and AVX512VL instructions.Then your code was probably compiled with AVX-512 instructions and it landed on a compute node which does not supports those instructions (AMD vs. Intel). Run the command "shistory -j" to see the node type of your recent jobs (single-node jobs only).One solution would be to rebuild the code while removing the optimization flags that are adding the AVX-512 instructions such as "-xHost" and "-march=native".GPU JobsThe CPUs on on della-gpu are AMD supporting AVX2 while those on della8 are Intel supporting AVX-512. The CPUs on the nodes with the 40 GB GPUs are AMD while the CPUs on the nodes with the 80 GB GPUs are Intel. If you compile your code on della8 with AVX-512 instructions and then run it on a GPU node with AMD CPUs, it will fail with:Illegal instruction (core dumped)There are two solutions to this problem. The simplest solution is to add the following line to your Slurm script so that you always land on the GPU nodes with Intel CPUs:#SBATCH --constraint="intel&gpu80"One downside to this approach is that your queue times for test jobs (less than 1 hour) may increase. The second solution is to recompile the code on the della-gpu login node.OnDemand JobsIllegal instruction errors can also happen with OnDemand sessions. The solution is to choose the appropriate "Node type".GPU JobsThe login node to the GPU portion of Della is della-gpu.princeton.edu. Be aware of the following:All GPU jobs should be submitted from della-gpu but della9 or other login nodes can also be used. Note that the CPUs on della-gpu are AMD. If you are compiling from source then keep the CPU model in mind. Failure to do this can lead to "illegal instruction" errors. Note that the 40 GB A100 GPUs are on nodes with AMD CPUs while the 80 GB GPUs are on nodes with Intel CPUs. The GPU nodes are for GPU jobs only. Do not run CPU jobs on the GPU nodes. To connect to a login node:$ ssh <YourNetID>@della-gpu.princeton.eduNodes with 10 GB of Memory per GPUDella provides three GPU options on the “gpu” partition: (1) a MIG GPU with 10 GB, (2) an A100 GPU with 40 GB, and (3) an A100 GPU with 80 GB. A MIG GPU is essentially a small A100 GPU with about 1/7th the performance and memory of an A100. MIG GPUs are ideal for interactive work and for codes that do not need a powerful GPU. The queue time for a MIG GPU is on average much less than that for an A100.A job can use a MIG GPU when:Only a single GPU is needed Only a single CPU-core is needed The required CPU memory is less than 32 GB The required GPU memory is less than 10 GB Please use a MIG GPU whenever possible.For batch jobs, add the following "partition" directive to your Slurm script to allocate a MIG GPU:#SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --partition=migFor interactive Slurm allocations, use the following:$ salloc --nodes=1 --ntasks=1 --time=60:00 --gres=gpu:1 --partition=migIn the command above, only the value of --time can be changed. All mig jobs are assigned a CPU memory of 32 GB. The GPU memory for MIG is always 10 GB. If your job exceeds either of these memory limits then it will fail.A MIG GPU can also be used for MyDella Jupyter notebooks as explained on the Jupyter page.To see the number of available MIG GPUs, run the command below and look at the "FREE" column:$ shownodes -p migNodes with 80 GB of Memory per GPUThere are 69 nodes with 4 GPUs per node. Each GPU has 80 GB of memory. To explicitly run on these nodes, use this Slurm directive:#SBATCH --constraint=gpu80Each node has two sockets with two GPUs per socket. The GPUs on the same socket are connected via NVLink.Nodes with 40 GB of Memory per GPUThere are 20 nodes with 2 GPUs per node. Each GPU has 40 GB of memory. There is no Slurm constraint to specifically run on these nodes. By requesting a GPU you will land on either the 40 or 80 GB GPUs. To run on the 80 GB GPUs see the Slurm constraint above. The figure above shows the login node and 20 compute nodes. Each compute node has 2 AMD EPYC CPUs and 2 NVIDIA A100 GPUs. Each CPU has 64 CPU-cores. Users do not need to concern themselves with the details of the CPUs or the CPU memory. Simply choose up to 128 CPU-cores per node and up to 768 GB of memory per node in your Slurm script. The compute nodes are interconnected using FDR Infiniband making multinode jobs possible.Each A100 GPU has 40 GB of memory and 6912 CUDA cores (FP32) across 108 Streaming Multiprocessors. Make sure you are using version 12.x of the CUDA Toolkit when possible. Not all codes will be able to take full advantage of these powerful accelerators. Please use a MIG GPU when that is the case. See the GPU Computing page to learn how to monitor GPU utilization using tools like “jobstats”. See an example Slurm script for a GPU job.More on GPU JobsTo see your GPU utilization every 10 minutes over the last hour, run this command:$ gpudash -u $USEROne can run "gpudash" without options to see the utilization across the entire cluster. You can also directly examine your GPU utilization in real time.To see the number of available GPUs, run this command and look at the "FREE" column:$ shownodes -p gpu,migPyTorchPyTorch can take advantage of mixed-precision training on the A100 and H100 GPUs. Follow these directions to install PyTorch on della-gpu. Then see the docs on AMP. You should also try using a DataLoader with multiple CPU-cores. For more ways to optimize your PyTorch jobs see PyTorch Performance Tuning Guide.Compiling from SourceAll GPU codes should be compiled on the della-gpu login node. When compiling, one should prefer the cudatoolkit/12.x modules. The A100 GPU has a compute capability of 8.0 while the H100 is 9.0. To compile a CUDA kernel for an A100, one might use the following commands:$ module load cudatoolkit/12.8 $ nvcc -O3 -arch=sm_80 -o myapp myapp.cuThe compute capability should also be specified when building codes using CMake. For example, for LAMMPS and GROMACS one would use -DGPU_ARCH=sm_80 and -DGMX_CUDA_TARGET_SM=80 for an A100 GPU, respectively.CUDA Multi-Process Service (MPS)Certain codes that use GPUs may benefit from CUDA MPS (see ORNL docs), which enables multiple processes to concurrently share the resources on a single GPU. To use MPS simply add this directive to your Slurm script:#SBATCH --gpu-mpsIn most cases users will see no speed-up. Codes where the individual MPI processes underutilize the GPU should see a performance gain.GPU Nodes for PLIMembers of Princeton Language and Intelligence (PLI) have exclusive access to 336 H100 SXM GPUs (42 nodes at 8 GPUs per node). The PLI portion of the Della cluster is designed for working with large AI models as described in this article. Each H100 GPU provides 80 GB of GPU memory and support for the FP8 numerical format. The GPUs within a node are connected in an all-to-all configuration with the high-speed interconnect NVLink. In addition to the standard network fabric, there is a dedicated Infiniband network (NDR) for internode GPU-GPU communication. There are 96 Intel CPU-cores and 1 TB of CPU memory per node. To see the availability of these nodes:$ shownodes -p pli-cYou must be a member of PLI[1] to run jobs on these nodes ([email protected]). To see if you are a member, run this command:$ getent group pliCore PLI MembersTo run batch jobs, add the following directive to your Slurm script:#SBATCH --partition=pli-cFor interactive jobs use, for example:$ salloc --nodes=1 --ntasks=1 --mem=4G --time=01:01:00 --gres=gpu:1 --partition=pli-c --mail-type=beginDo not run more than 2 interactive jobs simultaneously.Campus and Large Campus PLI MembersTo run batch jobs, add the following directives to your Slurm script:#SBATCH --partition=pli #SBATCH --account=<ACCOUNT> For interactive jobs use, for example:$ salloc --nodes=1 --ntasks=1 --mem=4G --time=01:01:00 --gres=gpu:1 --partition=pli --account=<ACCOUNT> --mail-type=beginFor large campus, use pli-lc instead of pli.Users are asked to not run more than 2 interactive jobs simultaneously.Grace Hopper SuperchipThere is one login node and one compute node for experimenting with the GH200 Grace Hopper Superchip. The main novelty of this hardware is the coherent memory between the CPU and GPU.If you have an account on the Della cluster then run the command below to connect to the login node (remote development with VS Code is not possible at this time):$ ssh <YourName>@della-gh.princeton.eduThe CPU is made by ARM. You will need to recompile your code on the login node. Use the “module avail” command to see the available environment modules. To run on the compute node, add the following directive to your Slurm script:#SBATCH --partition=graceFor interactive jobs use, for example:$ salloc --nodes=1 --ntasks=1 --mem=4G --time=01:01:00 --gres=gpu:1 --partition=graceNote that these nodes are provided in an experimental sense. If you encounter issues then please request support.GlobusResearch Computing has multiple Globus endpoints. An endpoint is known as a "Collection" in the Globus app. For Della /scratch/gpfs use "Princeton Della /scratch/gpfs" as shown below (replace "aturing" with your NetID): Running Software using the Previous Operating SystemThe operating system on most of the nodes of Della was upgraded from SDL 7 to SDL 8 in the winter of 2022. Users should reinstall or recompile their codes on the della8 login node. When this is not possible, we provide a compatibility tool for effectively running software under the old operating system (SDL 7). This involves prepending the command you want to run with /usr/licensed/bin/run7. Below are a few examples:$ /usr/licensed/bin/run7 cat /etc/os-release $ /usr/licensed/bin/run7 bash Singularity> source /home/aturing/env.sh Singularity> solar -i 42 -d /scratch/gpfs/aturing/outputVisualization NodesThe Della cluster has two dedicated nodes for visualization and post-processing tasks, called della-vis1 and della-vis2.Hardware DetailsThe della-vis1 node features 80 CPU-cores, 1 TB of memory and an A100 GPU with 40 GB of memory.The della-vis2 node features 28 CPU-cores, 256 GB of memory and four P100 GPUs with 16 GB of memory per GPU.Both nodes have internet access.How to Use the Visualization NodeUsers can connect via SSH with the following command (VPN required if connecting from off-campus)$ ssh <YourNetID>@della-vis1.princeton.edu $ ssh <YourNetID>@della-vis2.princeton.edubut to work with graphical applications on the visualization node, see our guide to working with visualizations and graphical user-interface (GUI) applications.Note that there is no job scheduler on della-vis1 or della-vis2, so please be considerate of other users when using this resource. To ensure that the system remains a shared resource, there are limits in place preventing one individual from using all of the resources. You can check your activity with the command "htop -u $USER".In addition to visualization, the nodes can be used for tasks that are incompatible with the Slurm job scheduler, or for work that is not appropriate for the Della login nodes (such as downloading large amounts of data from the internet).Running Third-party SoftwareIf you are running 3rd-party software whose characteristics (e.g., memory usage) you are unfamiliar with, please check your job after 5-15 minutes using 'top' or 'ps -ef' on the compute nodes being used. If the memory usage is growing rapidly, or close to exceeding the per-processor memory limit, you should terminate your job before it causes the system to hang or crash. You can determine on which node(s) your job is running using the "scontrol show job <jobnumber>" command.Maintenance WindowDella will be down for routine maintenance on the second Tuesday of every month from approximately 6 AM to 2 PM. This includes the associated filesystems of /scratch/gpfs and /projects. Please mark your calendar. Jobs submitted close to downtime will remain in the queue unless they can be scheduled to finish before downtime (see more). Users will receive an email when the cluster is returned to service. Wording of Acknowledgement of Support and/or Use of Research Computing Resources"The author(s) are pleased to acknowledge that the work reported on in this paper was substantially performed using the Princeton Research Computing resources at Princeton University which is consortium of groups led by the Princeton Institute for Computational Science and Engineering (PICSciE) and Research Computing.""The simulations presented in this article were performed on computational resources managed and supported by Princeton Research Computing, a consortium of groups including the Princeton Institute for Computational Science and Engineering (PICSciE) and Research Computing at Princeton University."