- Hardware Resources
- Getting Started with GPUs and Slurm
- GPU Utilization Dashboard
- Measuring GPU Utilization in Real Time
- stats.rc.princeton.edu and jobstats
- CUDA Multi-Process Service (MPS)
- How to Improve Your GPU Utilization
- Common Mistakes
- How to Improve Your GPU Knowledge and Skills
- GPU Hackathon
- Getting Help
GPUs, or graphics processing units, were originally used to process data for computer displays. While they are still used for this purpose, beginning in around 2008, the capabilities of GPUs were extended to make them excellent hardware accelerators for scientific computing. GPUs are used alongside a CPU to make quick work of numerically-intensive operations.
In a scientific code, a GPU is used in tandem with a CPU. That is, the CPU executes the main program with the GPU being used at certain times to carry out specific functions. A CPU is always needed to run a code that uses a GPU.
While a CPU has ones or tens of processing cores, a GPU has thousands. In both cases the processing cores can be used in parallel. For certain workloads like image processing, training artificial neural networks and solving differential equations, a GPU-enabled code can vastly outperform a CPU code. Algorithms that require lots of logic such as "if" statements tend to perform better on the CPU.
Consider a simple code that reads in a matrix (or 2-dimensional array of numbers) from a file, computes the inverse of the matrix and then writes the inverse to a file. Such a code can of course be ran using only a CPU. However, if the matrix is large then the speed of the computation would be limited by the capacity of the CPU. A GPU can be used in this case to run the code faster.
There are typically three main steps required to execute a function (a.k.a. kernel) on a GPU in a scientific code: (1) copy the input data from the CPU memory to the GPU memory, (2) load and execute the GPU kernel on the GPU and (3) copy the results from the GPU memory to CPU memory. Returning to our example above, the matrix would be read by the CPU and then transferred to the GPU where the matrix inverse would be carried out. The result would then be copied back to the CPU where it could be written to a file. This example illustrates the idea that a GPU is an accelerator or a piece of auxiliary hardware that can be used in tandem with a CPU to quickly carry out a specific numerically-intensive task.
If a CPU has 32 cores and a GPU has 3456 cores then one may be tempted to think that a GPU will outperform a CPU for operations that require more than 32 instructions. This turns out to be incorrect for a few reasons. First, in order to use the GPU, as explained above the data must be copied from the CPU to the GPU and then later from the GPU to the CPU. These two transfers take time which decreases the overall performance. A good GPU programmer will try to minimize this penalty by overlapping computation on the CPU with the data transfers. Second, to get the maximum performance out of the GPU, one must overload the accelerator (GPU) with operations. It turns out that the threshold number of operations for doing this is an order of magnitude larger than the number of GPU cores. These two reasons explain why the breakeven point for an algorithm carried out on the CPU versus the GPU must be determined empirically.
As with the CPU, a GPU can perform calculations in single precision (32-bit) faster than in double precision (64-bit). Additionally, in recent years, manufacturers have incorporated specialized units on the GPU called Tensor Cores (NVIDIA) or Matrix Cores (AMD) which can be used to perform certain operations in less than single precision (e.g., half precision) yielding even greater performance. This is particularly beneficial to researchers training artificial neural networks or, more generally, cases where matrix-matrix multiplications and related operations dominate the computation time. Modern GPUs, with or without these specialized units, can be used in conjunction with a CPU to accelerate scientific codes.
The following table shows infomation about the GPUs across the clusters:
|Cluster||Number of Nodes||GPUs per Node||GPU Model||FP64 Performance
per GPU (TFLOPS)
|Memory per GPU (GB)|
The login nodes of della-gpu and traverse have a GPU while tigergpu does not. The tigergpu cluster is expected to be replaced with a new cluster in early 2022. Tigressdata and jupyter.rc each provide a P100 GPU. The visualization nodes of Stellar also have GPUs. The adroit-vis node offers four K80 GPUs each with 12 GB of memory.
To see the GPU utilization every 10 minutes over the last hour on della-gpu, tigergpu and traverse, use the following command:
To see only the nodes where your jobs are running:
$ gpudash -u $USER
Consider adding the following alias to your ~/.bashrc file:
alias mygpus='gpudash -u $USER'
To see the number of GPUs that are currently available (i.e., "FREE") on any GPU clulster:
$ shownodes -p gpu
To see how many GPU nodes are available:
$ sinfo -p gpu
In the output of the command above, "idle" means that all of the GPUs in the node are available, "mix" means that at least one of the GPUs is allocated and "alloc" means that all of the GPUs are allocated. Often times none of the GPU nodes are idle.
If you are finding that your queue times are longer than expected then read the job priority page for helpful information.
To see how effectively your job is using the GPU, first find the node where the job is running:
$ squeue -u $USER
The rightmost column labeled "NODELIST(REASON)" gives the name of the node where your job is running. If your job is queued instead of running then the node name is not available and you will need to wait. Once you have the node name then SSH to that node:
$ ssh tiger-iXXgYY
In the command above, you must replace XX and YY with the actual values (e.g., ssh tiger-i19g1). Once on the compute node, run watch -n 1 gpustat. This will show you a percentage value indicating how effectively your code is using the GPU. The temperature and amount of GPU memory being used are also available (e.g., 1813 / 16280 MB). You could also use nvidia-smi instead of gpustat. Press Ctrl+C to exit from watch. Use the exit command to leave the compute node and return to the head node.
Note that GPU utilization as measured using nvidia-smi is only a measure of the fraction of the time that a GPU kernel is running on the GPU. It says nothing about how many CUDA cores are being used or how efficiently the GPU kernels have been written. However, for codes used by large communities, one can generally associate GPU utilization with overall GPU efficiency. For a more accurate measure of GPU utilization, use Nsight Systems or Nsight Compute to measure the occupancy (see the "Profiling" section below).
One can view GPU metrics as a function of time for running and completed jobs via stats.rc as described on the Job Stats page. This includes GPU utilization and memory.
One can also use the jobstats command:
$ jobstats 40274323 ... GPU Memory utilization, per node(GPU) - maximum used/total della-i14g8(GPU#0): 10.3GB/40.0GB (25.8%) GPU Utilization, per node(GPU) - average in percents della-i14g8(GPU#0): 54.5/100
If you are using PyTorch or TensorFlow on A100 GPUs then consider profiling your code with dlprof by NVIDIA. dlprof provides suggestions on how to improve performance.
NVIDIA provides Nsight Systems for profiling GPU codes. It produces a timeline and can handle MPI but produces a different set of profiling data for each MPI process.
To look closely at the behavior of specific GPU kernels, NVIDIA provides Nsight Compute.
Certain MPI codes that use GPUs may benefit from CUDA MPS (see ORNL docs), which enables multiple processes to concurrently share the resources on a single GPU. This is available on della-gpu and traverse. To use MPS simply add this directive to your Slurm script:
In most cases users will see no speed-up. Codes where the individual MPI processes underutilize the GPU should see a performance gain.
Recall that there are typically three main steps to executing a function on a GPU in a scientific code: (1) copy the input data from the CPU memory to the GPU memory, (2) load and execute the GPU kernel on the GPU and (3) copy the results from the GPU memory to CPU memory. Effective GPU utilization requires minimizing data transfer between the CPU and GPU while at the same time maintaining a sufficiently high transfer rate to keep the GPU busy with intensive computations.
When the GPU is underutilized the reason is often that data is not being sent to it fast enough. In some cases this is due to hardware limitations such as slow interconnects while in others it is due to poorly written CPU code or users not taking advantage of the data loading/transfer functionality of their software.
If you are experiencing poor GPU utilization then try writing to the mailing list for your code and asking for ways to improve performance. In some cases just making a single change in your input file can lead to excellent performance. If you are running a deep learning code such as PyTorch or TensorFlow then try using the specialized classes and functions for loading data (PyTorch or TensorFlow). One can also try varying the batch size. These two changes can be sufficient to increase the data transfer rate and keep the GPU busy. Keep in mind that an NVIDIA P100 GPU has 16 GB memory. If you exceed this value then you will encounter a CUDA out of memory error which will cause the code to crash.
If you are unable to find a way to reach an acceptable level of GPU utilization then please move your work to the CPU clusters such as TigerCPU or Della.
The most common mistake is running a CPU-only code on a GPU node. Only codes that have been explicitly written to run on a GPU can take advantage of a GPU. Read the documentation for the code that you are using to see if it can use a GPU.
Another common mistake is to run a code that is written to work for a single GPU on multiple GPUs. TensorFlow, for example, will only take advantage of more than one GPU if your script is explicitly written to do so. Note that in all cases, whether your code actually used the GPU or not, your fairshare value will be reduced in proportion to the resources you requested in your Slurm script. This means that the priority of your next job will be decreased accordingly. Because of this, and to not waste resources, it is very important to make sure that you only request GPUs when you can efficiently utilize them.
Check out this comprehensive resources list.
- What is a GPU and how does it compare to a CPU?
- CUDA Toolkit
- Running a simple Slurm GPU job with Python, R, PyTorch, TensorFlow, MATLAB and Julia
- GPU tools for measuring utilization, code profiling and debugging
- Using the CUDA libraries
- Writing simple CUDA kernels
- CUDA-aware MPI, GPU Direct, CUDA Multi-Process Service, Intel oneAPI and Sycl
Princeton has held an annual GPU hackathon since the summer of 2019. This multi-day GPU hackathon hosted by Princeton Research Computing and sponsored by Oak Ridge National Laboratory (ORNL) and NVIDIA aims to reduce the barrier to entry. Participants will work alongside experienced mentors from industry and from various national laboratories to migrate their code to GPUs and/or optimize codes already running on GPUs.
The 2020 Princeton hackathon consisted of nine participating teams of 3-6 developers each, covering a range of disciplines. Access to computing clusters is provided for the duration of the hackathon.