GPU Computing

Introduction

GPUs or graphics processing units were originally used to process data for computer displays. While they are still used for this purpose, beginning in around 2008, the capabilities of GPUs were extended to make them excellent hardware accelerators for scientific computing.

In a scientific code, a GPU is used in tandem with a CPU. That is, the main line of execution is carried out on the CPU with the GPU being used at certain times to execute specific functions. A CPU is always needed to run a code that uses a GPU.

While a CPU has tens of processing cores, a GPU has thousands. In both cases the processing cores can be used in parallel. For certain numerical operations such as linear algebra and grid-based methods, the GPU vastly outperforms the CPU. Algorithms that require lots of logic such as "if" statements tend to perform better on the CPU.

Consider a code that reads in two matrices from a file, performs a matrix-matrix multiplication and then prints the output to the display. Such a code can of course be ran using only a CPU. However, if the matrices are large then the speed of the computation will be limited the small number of cores in the CPU. A GPU can be used in this case to run the code faster.

There are typically three main steps to executing a function (a.k.a. kernel) on a GPU in a scientific code: (1) copy the input data from the CPU memory to the GPU memory, (2) load and execute the GPU kernel on the GPU and (3) copy the results from the GPU memory to CPU memory. Returning to our example above, the matrices would be read by the CPU and then transferred to the GPU where the matrix-matrix multiplication would be carried out. The result would then be copied back to the CPU where it could be sent to the terminal. This example illustrates the idea that a GPU is an accelerator or a piece of auxiliary hardware that can be used in tandem with a CPU to quickly carry out a specific numerically-intensive task.

If a CPU has 32 cores and a GPU has 3456 cores then one may be tempted to think that a GPU will outperform a CPU for operations that require more than 32 instructions. This turns out to be is incorrect for multiple reasons. First, in order to use the GPU, as explained above the data must be copied from the CPU to the GPU and then later from the GPU to the CPU. These two transfers take time which hurts the overall performance. A good GPU programmer will try to minimize this penalty by overlapping computation on the CPU with the data transfers. Second, to get the maximum performance out of the GPU, one must saturate the accelerator with enqueued operations. It turns out that the threshold number of operations for doing this is an order of magnitude larger than the number of execution units. These two reasons explain why the breakeven point for an algorithm performed on the CPU versus the GPU must be determined empirically.

As with the CPU, a GPU can perform calculations in single precision faster than in double precision. Additionally, specialized units on the GPU called Tensor Cores (NVIDIA) or Matrix Cores (AMD) can be used to perform certain operations in less than single precision (e.g., half precision) yielding even greater performance. This is particularly beneficial to researchers training artificial neural networks.

 

Hardware Resources

The following table shows infomation about the GPUs across the clusters:

Cluster Number of Nodes GPUs per Node GPU Model FP64 Performance
per GPU (TFLOPS)
Memory per GPU (GB)
adroit 1 4 NVIDIA V100 7 32
adroit 1 4 NVIDIA A100 9.7 40
della-gpu 20 2 NVIDIA A100 9.7 40
della-milan 1 1 AMD MI100 11.5 32
stellar 6 2 NVIDIA A100 9.7 40
tigergpu 80 4 NVIDIA P100 4.7 16
traverse 46 4 NVIDIA V100 7.8 32

The login nodes of della-gpu and traverse have a GPU while tigergpu does not. The tigergpu cluster is expected to be replaced with a new cluster in early 2022. Tigressdata and jupyter.rc each provide a P100 GPU. The visualization nodes of Stellar also have GPUs.

 

Getting Started with GPUs and Slurm

See our Intro to GPUs workshop for example GPU jobs with Python, R, PyTorch, TensorFlow, MATLAB and Julia. Additional information is found on our Slurm webpage.

 

GPU Utilization Dashboard

To see the GPU utilization every 10 minutes over the last hour on della-gpu, tigergpu and traverse, use the following command:

$ gpudash

To see only the nodes where your jobs are running:

$ gpudash -u $USER

Consider adding the following alias to your ~/.bashrc file:

alias mygpus='gpudash -u $USER'

To see the number of GPUs that are currently available (i.e., "FREE") on any GPU clulster:

$ shownodes -p gpu

To see how many GPU nodes are available:

$ sinfo -p gpu

In the output of the command above, "idle" means that all of the GPUs in the node are available, "mix" means that at least one of the GPUs is allocated and "alloc" means that all of the GPUs are allocated. Often times none of the GPU nodes are idle.

If you are finding that your queue times are longer than expected then read the job priority page for helpful information.

 

Measuring GPU Utilization in Real Time

To see how effectively your job is using the GPU, first find the node where the job is running:

$ squeue -u $USER

The rightmost column labeled "NODELIST(REASON)" gives the name of the node where your job is running. If your job is queued instead of running then the node name is not available and you will need to wait. Once you have the node name then SSH to that node:

$ ssh tiger-iXXgYY

In the command above, you must replace XX and YY with the actual values (e.g., ssh tiger-i19g1). Once on the compute node, run watch -n 1 gpustat. This will show you a percentage value indicating how effectively your code is using the GPU. The temperature and amount of GPU memory being used are also available (e.g., 1813 / 16280 MB). You could also use nvidia-smi instead of gpustat. Press Ctrl+C to exit from watch. Use the exit command to leave the compute node and return to the head node.

 

stats.rc.princeton.edu

One can view GPU metrics as a function of time for running and completed jobs via stats.rc as described on the Job Stats page. This includes GPU utilization, memory, temperature and power.

 

Profiling

The starting point for profiling a Python code that uses a GPU (this includes PyTorch and TensorFlow) is to use line_profiler (see demonstration webpage or video).

If you are using PyTorch or TensorFlow on A100 GPUs then consider profiling your code with dlprof by NVIDIA. dlprof provides suggestions on how to improve performance.

NVIDIA provides Nsight Systems for profiling GPU codes. It produces a timeline and can handle MPI but produces a different set of profiling data for each MPI process.

To look closely at the behavior of specific GPU kernels, NVIDIA provides Nsight Compute.

 

CUDA Multi-Process Service (MPS)

Certain MPI codes that use GPUs may benefit from CUDA MPS (see ORNL docs), which enables multiple processes to concurrently share the resources on a single GPU. This is available on della-gpu and traverse. To use MPS simply add this directive to your Slurm script:

#SBATCH --gpu-mps

In most cases users will see no speed-up. Codes where the individual MPI processes underutilize the GPU should see a performance gain.

 

How to Improve Your GPU Utilization

Recall that there are typically three main steps to executing a function on a GPU in a scientific code: (1) copy the input data from the CPU memory to the GPU memory, (2) load and execute the GPU kernel on the GPU and (3) copy the results from the GPU memory to CPU memory. Effective GPU utilization requires minimizing data transfer between the CPU and GPU while at the same time maintaining a sufficiently high transfer rate to keep the GPU busy with intensive computations.

When the GPU is underutilized the reason is often that data is not being sent to it fast enough. In some cases this is due to hardware limitations such as slow interconnects while in others it is due to poorly written CPU code or users not taking advantage of the data loading/transfer functionality of their software.

If you are experiencing poor GPU utilization then try writing to the mailing list for your code and asking for ways to improve performance. In some cases just making a single change in your input file can lead to excellent performance. If you are running a deep learning code such as PyTorch or TensorFlow then try using the specialized classes and functions for loading data (PyTorch or TensorFlow). One can also try varying the batch size. These two changes can be sufficient to increase the data transfer rate and keep the GPU busy. Keep in mind that an NVIDIA P100 GPU has 16 GB memory. If you exceed this value then you will encounter a CUDA out of memory error which will cause the code to crash.

If you are unable to find a way to reach an acceptable level of GPU utilization then please move your work to the CPU clusters such as TigerCPU or Della.

 

Common Mistakes

The most common mistake is running a CPU-only code on a GPU node. Only codes that have been explicitly written to run on a GPU can take advantage of a GPU. Read the documentation for the code that you are using to see if it can use a GPU.

Another common mistake is to run a code that is written to work for a single GPU on multiple GPUs. TensorFlow, for example, will only take advantage of more than one GPU if your script is explicitly written to do so. Note that in all cases, whether your code actually used the GPU or not, your fairshare value will be reduced in proportion to the resources you requested in your Slurm script. This means that the priority of your next job will be decreased accordingly. Because of this, and to not waste resources, it is very important to make sure that you only request GPUs when you can efficiently utilize them.

 

How to Improve Your GPU Knowledge and Skills

At good starting point for getting better with GPUs is the Intro to GPU Programming workshop material and the links therein. This workshop covers the following:

  • What is a GPU and how does it compare to a CPU?
  • CUDA Toolkit
  • Running a simple Slurm GPU job with Python, R, PyTorch, TensorFlow, MATLAB and Julia
  • GPU tools for measuring utilization, code profiling and debugging
  • Using the CUDA libraries
  • OpenACC
  • Writing simple CUDA kernels
  • CUDA-aware MPI, GPU Direct, CUDA Multi-Process Service, Intel oneAPI and Sycl

 

GPU Hackathon

Princeton has held an annual GPU hackathon since the summer of 2019. This multi-day GPU hackathon hosted by Princeton Research Computing and sponsored by Oak Ridge National Laboratory (ORNL) and NVIDIA aims to reduce the barrier to entry. Participants will work alongside experienced mentors from industry and from various national laboratories to migrate their code to GPUs and/or optimize codes already running on GPUs.

The 2020 Princeton hackathon consisted of nine participating teams of 3-6 developers each, covering a range of disciplines. Access to computing clusters is provided for the duration of the hackathon.

 

Getting Help

For help with GPU computing please send an email to cses@princeton.edu or attend a help session.