GPU Computing

OUTLINE

 

Introduction

GPUs, or graphics processing units, were originally used to process data for computer displays. While they are still used for this purpose, beginning in around 2008, the capabilities of GPUs were extended to make them excellent hardware accelerators for scientific computing. GPUs are used alongside a CPU to make quick work of numerically-intensive operations.

In a scientific code, a GPU is used in tandem with a CPU. That is, the CPU executes the main program with the GPU being used at times to carry out specific functions. A CPU is always needed to run a code that uses a GPU.

While a CPU has ones or tens of processing cores, a GPU has thousands. In both cases the processing cores can be used in parallel. For certain workloads like image processing, training artificial neural networks and solving differential equations, a GPU-enabled code can vastly outperform a CPU code. Algorithms that require lots of logic such as "if" statements tend to perform better on the CPU.

Consider a simple code that reads in a matrix (or 2-dimensional array of numbers) from a file, computes the inverse of the matrix and then writes the inverse to a file. Such a code can of course be ran using only a CPU. However, if the matrix is large then the speed of the computation would be limited by the capacity of the CPU. A GPU can be used in this case to run the code faster.

CPU and GPU
In scientific computing, a GPU is used as an accelerator or a piece of auxiliary hardware that is used in tandem with a CPU to quickly carry out numerically-intensive operations.

There are typically three main steps required to execute a function (a.k.a. kernel) on a GPU in a scientific code: (1) copy the input data from the CPU memory to the GPU memory, (2) load and execute the GPU kernel on the GPU and (3) copy the results from the GPU memory to CPU memory. Returning to our example above, the matrix would be read by the CPU and then transferred to the GPU where the matrix inverse would be carried out. The result would then be copied back to the CPU where it could be written to a file. This example illustrates the idea that a GPU is an accelerator or a piece of auxiliary hardware that can be used in tandem with a CPU to quickly carry out a specific numerically-intensive task.

If a CPU has 32 cores and a GPU has 3456 cores then one may be tempted to think that a GPU will outperform a CPU for operations that require more than 32 instructions. This turns out to be incorrect for a few reasons. First, in order to use the GPU, as explained above the data must be copied from the CPU to the GPU and then later from the GPU to the CPU. These two transfers take time which decreases the overall performance. A good GPU programmer will try to minimize this penalty by overlapping computation on the CPU with the data transfers. Second, to get the maximum performance out of the GPU, one must overload the accelerator (GPU) with operations. It turns out that the threshold number of operations for doing this is an order of magnitude larger than the number of GPU cores. These two reasons explain why the breakeven point for an algorithm carried out on the CPU versus the GPU must be determined empirically.

As with the CPU, a GPU can perform calculations in single precision (32-bit) faster than in double precision (64-bit). Additionally, in recent years, manufacturers have incorporated specialized units on the GPU called Tensor Cores (NVIDIA) or Matrix Cores (AMD) which can be used to perform certain operations in less than single precision (e.g., half precision) yielding even greater performance. This is particularly beneficial to researchers training artificial neural networks or, more generally, cases where matrix-matrix multiplications and related operations dominate the computation time. Modern GPUs, with or without these specialized units, can be used in conjunction with a CPU to accelerate scientific codes.

Watch a YouTube video by "Mythbusters" illustrating the difference between a single-core CPU and a GPU.

 

Hardware Resources

The following table shows infomation about the GPUs across the clusters:

Cluster Number of Nodes GPUs per Node GPU Model FP64 Performance
per GPU (TFLOPS)
Memory per GPU (GB)
adroit 1 4 NVIDIA V100 7 32
adroit 1 4 NVIDIA A100 9.7 40
adroit 1 4 NVIDIA A100 9.7 80
della-gpu 4 8 NVIDIA H100 34 80
della-gpu 69 4 NVIDIA A100 9.7 80
della-gh 1 1 NVIDIA GH200 34 96
della-gpu 20 2 NVIDIA A100 9.7 40
della-gpu 2 28 NVIDIA A100 (MIG) 1.4 10
della-milan 1 2 AMD MI210 11.5 64
stellar 6 2 NVIDIA A100 9.7 40
traverse 46 4 NVIDIA V100 7.8 32

The login nodes of della-gpu and traverse have a GPU. Tigressdata and jupyter.rc each provide a P100 GPU. The visualization nodes of Stellar and Della also have GPUs. The adroit-vis node offers two A100 GPUs each with 80 GB of memory.

 

Getting Started with GPUs and Slurm

See our Intro to GPUs workshop for example GPU jobs in Python, R, PyTorch, TensorFlow, MATLAB and Julia. Additional information is found on our Slurm webpage.

 

GPU Utilization Dashboard

To see the GPU utilization every 10 minutes over the last hour on della-gpu and traverse, use the following command:

$ gpudash

To see only the nodes where your jobs are running:

$ gpudash -u $USER

Below is an example for the user u8434:

gpudash command

Consider adding the following alias to your ~/.bashrc file:

alias mygpus='gpudash -u $USER'

To see the number of GPUs that are currently available (i.e., "FREE") on any GPU clulster:

$ shownodes -p gpu

To see how many GPU nodes are available:

$ sinfo -p gpu

In the output of the command above, "idle" means that all of the GPUs in the node are available, "mix" means that at least one of the GPUs is allocated and "alloc" means that all of the GPUs are allocated. Often times none of the GPU nodes are idle.

If you are finding that your queue times are longer than expected then read the job priority page for helpful information.

 

Measuring GPU Utilization in Real Time

To see how effectively your job is using the GPU, first find the node where the job is running:

$ squeue -u $USER

The rightmost column labeled "NODELIST(REASON)" gives the name of the node where your job is running. If your job is queued instead of running then the node name is not available and you will need to wait. Once you have the node name then SSH to that node:

$ ssh della-iXXgYY

In the command above, you must replace XX and YY with the actual values (e.g., ssh della-l03g12). Once on the compute node, run watch -n 1 gpustat. This will show you a percentage value indicating how effectively your code is using the GPU. The temperature and amount of GPU memory being used are also available (e.g., 1813 / 16280 MB). You could also use nvidia-smi instead of gpustat. Press Ctrl+C to exit from watch. Use the exit command to leave the compute node and return to the head node.

Note that GPU utilization as measured using nvidia-smi is only a measure of the fraction of the time that a GPU kernel is running on the GPU. It says nothing about how many CUDA cores are being used or how efficiently the GPU kernels have been written. However, for codes used by large communities, one can generally associate GPU utilization with overall GPU efficiency. For a more accurate measure of GPU utilization, use Nsight Systems or Nsight Compute to measure the occupancy (see the "Profiling" section below).

 

GPU Job Statistics

Run the "jobstats" command on a given JobID to see GPU job summary information for running and completed jobs:

$ jobstats 1234567

================================================================================
                              Slurm Job Statistics
================================================================================
         Job ID: 1234567
  NetID/Account: aturing/math
       Job Name: sys_logic_ordinals
          State: COMPLETED
          Nodes: 2
      CPU Cores: 48
     CPU Memory: 256GB (5.3GB per CPU-core)
           GPUs: 4
  QOS/Partition: della-gpu/gpu
        Cluster: della
     Start Time: Fri Mar 4, 2022 at 1:56 AM
       Run Time: 18:41:56
     Time Limit: 4-00:00:00

                              Overall Utilization
================================================================================
  CPU utilization  [|||||                                          10%]
  CPU memory usage [|||                                             6%]
  GPU utilization  [||||||||||||||||||||||||||||||||||             68%]
  GPU memory usage [|||||||||||||||||||||||||||||||||              66%]

                              Detailed Utilization
================================================================================
  CPU utilization per node (CPU time used/run time)
      della-i14g2: 1-21:41:20/18-16:46:24 (efficiency=10.2%)
      della-i14g3: 1-18:48:55/18-16:46:24 (efficiency=9.5%)
  Total used/runtime: 3-16:30:16/37-09:32:48, efficiency=9.9%

  CPU memory usage per node - used/allocated
      della-i14g2: 7.9GB/128.0GB (335.5MB/5.3GB per core of 24)
      della-i14g3: 7.8GB/128.0GB (334.6MB/5.3GB per core of 24)
  Total used/allocated: 15.7GB/256.0GB (335.1MB/5.3GB per core of 48)

  GPU utilization per node
      della-i14g2 (GPU 0): 65.7%
      della-i14g2 (GPU 1): 64.5%
      della-i14g3 (GPU 0): 72.9%
      della-i14g3 (GPU 1): 67.5%

  GPU memory usage per node - maximum used/total
      della-i14g2 (GPU 0): 26.5GB/40.0GB (66.2%)
      della-i14g2 (GPU 1): 26.5GB/40.0GB (66.2%)
      della-i14g3 (GPU 0): 26.5GB/40.0GB (66.2%)
      della-i14g3 (GPU 1): 26.5GB/40.0GB (66.2%)

                                     Notes
================================================================================
  * For additional job metrics including metrics plotted against time:
    https://mydella.princeton.edu/pun/sys/jobstats  (VPN required off-campus)

One can see various GPU metrics as a function of time (e.g., utilization and memory) for running and completed jobs via the Jobstats web interface.

Want to use the Jobstats job monitoring platform at your institution? See the Jobstats GitHub repo.

 

Profiling

The starting point for profiling a Python code that uses a GPU (this includes PyTorch and TensorFlow) is to use line_profiler (see demonstration webpage or video).

NVIDIA provides Nsight Systems for profiling GPU codes. It produces a timeline and can handle MPI but produces a different set of profiling data for each MPI process.

To look closely at the behavior of specific GPU kernels, NVIDIA provides Nsight Compute.

It is recommended to use the full path for nsys. This will avoid potential conflict with older versions which can become available when loading a cudatoolkit module. For ncu, load an appropriate cudatoolkit module (preferred the latest versions).

$ /usr/local/bin/nsys --version
NVIDIA Nsight Systems version 2023.2.3.1004-33186433v0
$ module load cudatoolkit/12.2
$ ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2023.2.2.0 (build 33188574) (public-release)

The corresponding GUIs for these tools are not found the the compute nodes. You will need to use the login or visualization nodes. Or download the file to your local machine. Consider using either mydella or mystellar to start a graphical desktop (VPN required from off-campus). To start a graphical desktop, choose "Interactive Apps" then "Desktop of Della/Stellar Vis Nodes". Once the session starts, click on the black terminal icon and then run either "nsys-ui" or "ncu-ui". See example Slurm scripts and more.

 

CUDA Multi-Process Service (MPS)

Certain MPI codes that use GPUs may benefit from CUDA MPS (see ORNL docs), which enables multiple processes to concurrently share the resources on a single GPU. This is available on della-gpu and traverse. To use MPS simply add this directive to your Slurm script:

#SBATCH --gpu-mps

In most cases users will see no speed-up. Codes where the individual MPI processes underutilize the GPU should see a performance gain.

 

How to Improve Your GPU Utilization

Recall that there are typically three main steps to executing a function on a GPU in a scientific code: (1) copy the input data from the CPU memory to the GPU memory, (2) load and execute the GPU kernel on the GPU and (3) copy the results from the GPU memory to CPU memory. Effective GPU utilization requires minimizing data transfer between the CPU and GPU while at the same time maintaining a sufficiently high transfer rate to keep the GPU busy with intensive computations. The algorithm running on the GPU must also be amenable to the GPU.

When the GPU is underutilized the reason is often that data is not being sent to it fast enough. In some cases this is due to hardware limitations such as slow interconnects while in others it is due to poorly written CPU code or users not taking advantage of the data loading/transfer functionality of their software.

If you are experiencing poor GPU utilization then try writing to the mailing list for your code and asking for ways to improve performance. In some cases just making a single change in your input file can lead to excellent performance. If you are running a deep learning code such as PyTorch or TensorFlow then try using the specialized classes and functions for loading data (PyTorch or TensorFlow). One can also try varying the batch size if this does not affect the overall performance of the model (e.g., accuracy or RMSE). These two changes can be sufficient to increase the data transfer rate and keep the GPU busy. Keep in mind that an NVIDIA A100 GPU has either 40 or 80 GB memory. If you exceed this value then you will encounter a CUDA out of memory error which will cause the code to crash.

If you are unable to find a way to reach an acceptable level of GPU utilization then please run your jobs on CPU nodes.

 

Zero GPU Utilization

Below are three common reasons why a user may encounter 0% GPU utilization:

  1. Is your code GPU-enabled? Only codes that have been explicitly written to use GPUs can take advantage of them. Please consult the documentation for your software. If your code is not GPU-enabled then please remove the --gres Slurm directive when submitting jobs.
  2. Make sure your software environment is properly configured. In some cases certain libraries must be available for your code to run on GPUs. The solution can be to load an environment module or to install a specific software dependency. If your code uses CUDA then CUDA Toolkit 11 or higher should be used on Della. Please check your software environment against the installation directions for your code.
  3. Please do not create salloc sessions for long periods of time. For example, allocating a GPU for 24 hours is wasteful unless you plan to work intensively during the entire period. For interactive work, please consider using the MIG GPUs.

 

Low GPU Utilization: Potential Solutions

If you encounter low GPU utilization (e.g., less than 15%) then please investigate the reasons for the low utilization. Common reasons include:

  1. Misconfigured application scripts. Be sure to read the documentation of the software to make sure that you are using it properly. This includes creating the appropriate software environment.
  2. Using an A100 GPU when a MIG GPU would be sufficient. Some codes do not have enough work to keep an A100 GPU busy. If you encounter this on the Della cluster then consider using a MIG GPU.
  3. Training deep learning models while only using a single CPU-core. Codes such as PyTorch and TensorFlow show performance benefits when multiple CPU-cores are used for the data loading.
  4. Using too many GPUs for a job. You can find the optimal number of GPUs and CPU-cores by performing a scaling analysis.
  5. Writing job output to the /tigress or /projects storage systems. Actively running jobs should be writing output files to /scratch/gpfs which is a much faster filesystem. See Data Storage for more.

 

Common Mistakes

The most common mistake is running a CPU-only code on a GPU node. Only codes that have been explicitly written to run on a GPU can take advantage of a GPU. Read the documentation for the code that you are using to see if it can use a GPU.

Another common mistake is to run a code that is written to work for a single GPU on multiple GPUs. TensorFlow, for example, will only take advantage of more than one GPU if your script is explicitly written to do so. Note that in all cases, whether your code actually used the GPU or not, your fairshare value will be reduced in proportion to the resources you requested in your Slurm script. This means that the priority of your next job will be decreased accordingly. Because of this, and to not waste resources, it is very important to make sure that you only request GPUs when you can efficiently utilize them.

 

How to Improve Your GPU Knowledge and Skills

Check out this comprehensive resources list.

A good starting point for getting better with GPUs at Princeton is the Intro to GPU Programming workshop material as well as the A100 GPU workshop. These workshops cover the following:

  • What is a GPU and how does it compare to a CPU?
  • CUDA Toolkit
  • Running a simple Slurm GPU job with Python, R, PyTorch, TensorFlow, MATLAB and Julia
  • GPU tools for measuring utilization, code profiling and debugging
  • Using the CUDA libraries
  • OpenACC
  • Writing simple CUDA kernels
  • CUDA-aware MPI, GPU Direct, CUDA Multi-Process Service, Intel oneAPI and Sycl

 

GPU Hackathon

Princeton has held an annual GPU hackathon since the summer of 2019. This multi-day GPU hackathon hosted by Princeton Research Computing and sponsored by Oak Ridge National Laboratory (ORNL) and NVIDIA aims to reduce the barrier to entry. Participants will work alongside experienced mentors from industry and from various national laboratories to migrate their code to GPUs and/or optimize codes already running on GPUs.

See the website for the 2024 Princeton Open Hackathon or read an article about the 2023 event.

 

Getting Help

For help with GPU computing please send an email to [email protected] or attend a help session.