PyTorch on the HPC Clusters

OUTLINE

 

Installation

PyTorch is a popular deep learning library for training artificial neural networks. The installation procedure depends on the cluster. If you are new to installing Python packages then see our Python page before continuing. Before installing make sure you have approximately 3 GB of free space in /home/<YourNetID> by running the checkquota command.

Della (GPU)

The GPU nodes on Della feature the NVIDIA A100 GPU which benefits from the most recent version of the cudatoolkit (procedure below requires 9.3 GB of space so see the bottom of the checkquota page for tips on dealing with large Conda environments):

$ ssh <YourNetID>@della-gpu.princeton.edu
$ module load anaconda3/2020.11
$ conda create --name torch-env pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
$ conda activate torch-env

Be sure to include conda activate torch-env in your Slurm script. Instead of installing via conda, one could also use the latest container from NVIDIA. See the docs on AMP for doing mixed-precision training with the A100. For more ways to optimize your PyTorch jobs see "PyTorch Performance Tuning Guide" from GTC 2021.

TigerGPU or Adroit (GPU)

The procedure below requires 7.1 GB of space:

$ module load anaconda3/2020.11
$ conda create --name torch-env pytorch torchvision torchaudio cudatoolkit=10.2 --channel pytorch
$ conda activate torch-env

Or maybe you want a few additional packages like matplotlib and tensorboard:

$ module load anaconda3/2020.11
$ conda create --name torch-env pytorch torchvision torchaudio cudatoolkit=10.2 matplotlib tensorboard --channel pytorch
$ conda activate torch-env

Be sure to include conda activate torch-env in your Slurm script.

Traverse

LATEST RELEASE

$ module load anaconda3/2020.11
$ CHNL="https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda"
$ conda create --name torch-env --channel ${CHNL} pytorch torchvision
$ conda activate torch-env
# accept the license agreement if asked

Be sure to include conda activate torch-env and #SBATCH --gpus-per-node=1 in your Slurm script.

EARLY ACCESS VERSION

If you would benefit from a newer version of PyTorch in pre-release form then use the early access channel instead: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access

TigerCPU, Della, Stellar or Adroit (CPU)

$ module load anaconda3/2020.11
$ conda create --name torch-env pytorch torchvision torchaudio cpuonly --channel pytorch
$ conda activate torch-env

Be sure to include conda activate torch-env in your Slurm script.

In addition to Anaconda, Intel offers a version of PyTorch that has been optimized for Intel hardware as part of their AI Analytics Toolkit.

 

Example GPU Job

The example below shows how to run a simple PyTorch script on one of the clusters. We will train a simple CNN on the MNIST data set. Begin by connecting to a head node on one of the clusters. Then clone the repo:

# ssh to a cluster
$ cd /scratch/gpfs/<YourNetID>  # or /scratch/network/<YourNetID> on Adroit
$ git clone https://github.com/PrincetonUniversity/install_pytorch.git
$ cd install_pytorch

This will create a folder called install_pytorch which contains the files needed to run this example. The compute nodes do not have internet access so we must obtain the data while on the head node:

$ python download_mnist.py

Inspect the PyTorch script called mnist_classify.py. Use a text editor like vim or emacs to enter your email address in job.slurm or delete the four lines concerned with email. Submit the job to the batch scheduler:

$ sbatch job.slurm

The Slurm script used for the job is below:

#!/bin/bash
#SBATCH --job-name=torch-test    # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=4G                 # total memory per node (4 GB per cpu-core is default)
#SBATCH --gres=gpu:1             # number of gpus per node
#SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send mail when job begins
#SBATCH --mail-type=end          # send mail when job ends
#SBATCH --mail-type=fail         # send mail if job fails
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
module load anaconda3/2020.11
conda activate torch-env

python mnist_classify.py --epochs=3

You can monitor the status of the job with squeue -u $USER. Once the job runs, you'll have a slurm-xxxxx.out file in the install_pytorch directory. This log file contains both PyTorch and Slurm output.

 

Data Loading using Multiple CPU-cores

Watch this video on our YouTube channel for a demonstration.

Even when using a GPU there are still operations carried out on the CPU. Some of these operations have been written to take advantage of multiple CPU-cores such as data loading. Try different values for --cpus-per-task in combination with using a DataLoader to see if you get a speed-up:

#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=<T>      # cpu-cores per task (>1 if multi-threaded tasks)

On each TigerGPU node there is a 7:1 ratio of CPU-cores to GPUs. Try doing a set of runs where you vary <T> from 1 to 7 to find the optimal value. Almost all PyTorch scripts show a significant performance improvement when using a DataLoader. In this case try setting num_workers equal to <T>. For the MNIST example above with <T> equal 4 and num_workers=4, there is a significant speed-up. Watch this video to learn about writing a custom DataLoader.

 

GPU Utilization

To see how effectively your job is using the GPU, after submiting the job run the following command:

$ squeue -u $USER

The rightmost column labeled "NODELIST(REASON)" gives the name of the node where your job is running. SSH to that node:

$ ssh tiger-iXXgYY

In the command above, you must replace XX and YY with the actual values (e.g., ssh tiger-i19g1). Once on the compute node run watch -n 1 gpustat. This will show you a percentage value indicating how effectively your code is using the GPU. The memory allocated to the GPU is also available. For the MNIST example above, in going from 1 to 8 data-loading workers the GPU utilization went from 18 to 55%. You should repeat this analysis with your actual research code to ensure that the GPU is being utilized. Be sure to learn about DataLoader and try increasing the value of cpus-per-task in tandem with num_workers to use multiple CPU-cores to prepare the data and keep the GPU busy. See the YouTube video above. You may also try varying the batch size.

Type Ctrl+C to exit the watch command. Use the exit command to leave the compute node and return to the head node.

For jobs that run for more than 10 minutes you can check utilization by looking at the TigerGPU utilization dashboard. See the bottom of that page for tips on improving utilization.

 

Distributed Training or Using Multiple GPUs

Most models can be trained in a reasonable amount of time using a single GPU. However, if you are effectively using the GPU as determined by the procedure above then you may consider running on multiple GPUs. In general this will lead to shorter training times but because more resources are required the queue time will increase. For any job submitted to the cluster you should choose the required resources (number of GPUs, number of CPU-cores, memory) that minimize the "time to finish" which is the time the job spends running on the compute nodes plus the time spent waiting in the queue. Do not assume that using all four GPUs on a node is the best choice, for instance.

The starting point for training PyTorch models on multiple GPUs is DistributedDataParallel which is the successor to DataParallel. In this approach, a copy of the model is assigned to each GPU where it operates on a different mini-batch. Keep in mind that by default the batch size is reduced when multiple GPUs are used. Be sure to use a DataLoader with multiple workers and the appropriate batch size to keep each GPU busy as discussed above.

For large models that do not fit in memory, there is the model parallel approach. In this case the model itself is distrbuted over multiple GPUs.

Also take a look at PyTorch Lightning and Horovod.

For hyperparameter tuning consider consider using a job array. This will allow you to run multiple jobs with one sbatch command. Each job within the array trains the network using a different set of parameters.

 

Building from Source

The directions for building PyTorch from source are here. The procedure below shows how those directions would be carried out on TigerGPU:

$ module load anaconda3/2020.11
$ conda create --name torch-env numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses
$ conda activate torch-env
$ conda install --channel pytorch magma-cuda102
$ module load cudatoolkit/10.2 cudnn/cuda-10.2/7.6.5 rh/devtoolset/8
$ git clone --recursive https://github.com/pytorch/pytorch
$ cd pytorch
$ export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
$ python setup.py install

 

Containers

Instead of installing PyTorch through Anaconda, one can obtain it by downloading a container. The advantage of using containers is that they often include performance optimizations, additional software, pre-trained models and/or data. Try browsing NVIDIA GPU Cloud (NGC) for useful containers such as PyTorch and TensorRT. Below is an example of running PyTorch via a container on TigerGPU:

$ cd /home/<YourNetID>/software  # or another directory
# consider using tag for latest version in line below
$ singularity pull docker://nvcr.io/nvidia/pytorch:21.04-py3
$ cd /scratch/gpfs/<YourNetID>  # or /scratch/network on adroit
$ git clone https://github.com/PrincetonUniversity/install_pytorch
$ cd install_pytorch
$ singularity exec $HOME/software/pytorch_21.04-py3.sif python3 download_mnist.py
$ sbatch job.slurm.ngc  # edit email address

The PyTorch container from NGC includes Torchvision, Apex and more. For more see Containers on the HPC Clusters.

 

Working Interactively with Jupyter on TigerGPU

See our Jupyter page and this YouTube video for running PyTorch in a Juypter notebook on TigerGPU.

 

NVIDIA Apex

If you are running on Traverse or the V100 node of Adroit then you can take advantage of the Tensor Cores in those GPUs. The Apex library allows for automatic mixed-precision (AMP) training and distributed training. AMP has been part of the PyTorch core since version 1.6. For earlier versions of PyTorch you will need to install Apex from Anaconda or from source. When performing the installation from source make sure you use the same CUDA toolkit that was used for PyTorch.

$ ssh <YourNetID>@traverse.princeton.edu  # or adroit but not tigergpu for AMP
$ module load anaconda3/2020.11 rh/devtoolset/8 cudatoolkit/10.2
$ conda activate torch-env
$ cd software  # or another directory
$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ export TORCH_CUDA_ARCH_LIST="6.0;7.0"
$ CUDA_HOME=/usr/local/cuda-10.2 pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

The speed-up comes from using the Tensor Cores on the GPU applied to matrix multiplications and convolutions. However, to use fp16 the dimension of each matrix must be a multiple of 8. Read about the constraints.

For simple PyTorch codes these are the necessary changes:

from apex import amp
...
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
...
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

You cannot do mixed-precision operations on TigerGPU with its older P100 GPUs.

 

PyTorch Geometric

PyTorch Geometric is a geometric deep learning extension library for PyTorch. First build a Conda environment containing PyTorch as described above then follow the steps below. Make sure that your version of PyTorch matches that of the packages below (e.g., 1.8):

$ conda activate torch-env
$ module load rh/devtoolset/8
$ CUDA="cu102"
$ URL="https://pytorch-geometric.com/whl/torch"
$ VERSION="1.8.0"
$ pip install torch-scatter -f ${URL}-${VERSION}+${CUDA}.html
$ pip install torch-sparse -f ${URL}-${VERSION}+${CUDA}.html
$ pip install torch-cluster -f ${URL}-${VERSION}+${CUDA}.html
$ pip install torch-spline-conv -f ${URL}-${VERSION}+${CUDA}.html
$ pip install torch-geometric

 

TensorBoard

A useful tool for tracking the training progress of a PyTorch model is TensorBoard. This can be run on the head node in non-intensive cases. TensorBoard is available via Conda (see installation instructions for TigerGPU above). See the directions for setting up TensorBoard.

 

Profiling and Performance Tuning

For performance tuning tips see this presentation by Szymon Migacz from NVIDIA GTC 2021. See the PyTorch Performance Tuning page by the same author.

For profiling, in almost all cases you should start with line_profiler (see Python Profiling). Other tools also exist. If you are running on a GPU then you can use the NVIDIA profiler nvprof or nsys to profile you code. For the MNIST example on this page, the Slurm script would be modified as follows:

/usr/local/cuda-10.1/bin/nvprof python mnist_classify.py --epochs=3

The most expensive GPU and CPU operations are shown below:

GPU activities:
   Time(%)      Time     Calls       Avg  Name
    33.16%  1.47105s      2811  523.32us  maxwell_scudnn_128x32_stridedB_medium_nn
     8.98%  398.53ms      2844  140.13us  maxwell_scudnn_128x64_relu_interior_nn
     8.57%  380.15ms      2814  135.09us  maxwell_scudnn_128x64_stridedB_splitK_interior_nn
...
API calls:   
    42.25%  3.43542s        26  132.13ms  5.1100us  3.43199s  cudaMalloc
    31.10%  2.52877s    234163  10.799us  5.0390us  2.4572ms  cudaLaunchKernel
     6.37%  517.94ms   1188571     435ns     254ns  1.7315ms  cudaGetDevice
...

"API calls" refers to operations on the CPU. We see that memory allocation dominates the work carried out on the CPU. [CUDA memcpy HtoD] and [CUDA memcpy HtoD] refer to data transfer between the CPU or Host (H) and the GPU or Device (D).

 

Reproducibility

You may find variation in your results from run to run as described in the PyTorch docs. Here are a set of functions for setting the seed:

seed = 12345
np.random.seed(seed)
torch.manual_seed(seed)
torch.random.manual_seed(seed)
torch.cuda.manual_seed(seed)

Note that even when the random number generators are seeded using the above code you may still see variation across identical runs.

 

Using PyCharm on TigerGPU

This video shows how to launch PyCharm on a TigerGPU compute node and use its debugger. While it was made using TensorFlow as the example application, the procedure also applies to PyTorch.

While debugging you may benefit from using unbuffered output of print statements. This can be achieved with:

print(<message>, flush=True)

And in the Slurm script add the -u option:

python -u <myscript.py> 

 

More examples

More PyTorch example scripts are found here: https://github.com/pytorch/examples

 

How to Learn PyTorch

See the material and companion website (English and Chinese) of Prof. Alf Canziani of NYU. PyTorch also offers a 60-minute introduction and tutorials.

There is also a free book.

 

Where to Store Your Files

You should run your jobs out of /scratch/gpfs/<YourNetID> on the HPC clusters. These filesystems are very fast and provide vast amounts of storage. Do not run jobs out of /tigress or /projects. That is, you should never be writing the output of actively running jobs to those filesystems. /tigress and /projects are slow and should only be used for backing up the files that you produce on /scratch/gpfs. Your /home directory on all clusters is small and it should only be used for storing source code and executables.

The commands below give you an idea of how to properly run a PyTorch job:

$ ssh <YourNetID>@tigergpu.princeton.edu
$ cd /scratch/gpfs/<YourNetID>
$ mkdir myjob && cd myjob
# put PyTorch script and Slurm script in myjob
$ sbatch job.slurm

If the run produces data that you want to backup then copy or move it to /tigress:

$ cp -r /scratch/gpfs/<YourNetID>/myjob /tigress/<YourNetID>

For large transfers consider using rsync instead of cp. Most users only do back-ups to /tigress every week or so. While /scratch/gpfs is not backed-up, files are never removed. However, important results should be transferred to /tigress or /projects.

The diagram below gives an overview of the filesystems:

HPC clusters and the filesystems that are available to each. Users should write job output to /scratch/gpfs.

 

Getting Help

If you encounter any difficulties while installing or running PyTorch on one of our HPC clusters then please send an email to cses@princeton.edu or attend a help session.