OUTLINE
- Installation
- Example Job
- Data Loading using Multiple CPU-cores
- GPU Utilization
- Distributed Training or Using Multiple GPUs
- Building from Source
- Containers
- Working Interactively with Jupyter on TigerGPU
- NVIDIA Apex
- PyTorch Geometric
- TensorBoard
- Profiling and Performance Tuning
- Reproducibility
- Using PyCharm on TigerGPU
- More Examples
- How to Learn PyTorch
- Where to Store Files
- Getting Help
Installation
PyTorch is a popular deep learning library for training artificial neural networks. The installation procedure depends on the cluster. If you are new to installing Python packages then see our Python page before continuing. Before installing make sure you have approximately 3 GB of free space in /home/<YourNetID> by running the checkquota command.
TigerGPU or Adroit (GPU)
$ module load anaconda3/2020.11 $ conda create --name torch-env pytorch torchvision torchaudio cudatoolkit=10.2 --channel pytorch $ conda activate torch-env
Or maybe you want a few additional packages like matplotlib and tensorboard:
$ module load anaconda3/2020.11 $ conda create --name torch-env pytorch torchvision torchaudio cudatoolkit=10.2 matplotlib tensorboard --channel pytorch $ conda activate torch-env
Be sure to include conda activate torch-env in your Slurm script.
Traverse
LATEST RELEASE
$ module load anaconda3/2020.11 $ CHNL="https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda" $ conda create --name torch-env --channel ${CHNL} pytorch torchvision $ conda activate torch-env # accept the license agreement if asked
Be sure to include conda activate torch-env and #SBATCH --gpus-per-node=1 in your Slurm script.
EARLY ACCESS VERSION
If you would benefit from a newer version of PyTorch in pre-release form then use the early access channel instead: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
TigerCPU, Della, Perseus or Adroit (CPU)
$ module load anaconda3/2020.11 $ conda create --name torch-env pytorch torchvision torchaudio cpuonly --channel pytorch $ conda activate torch-env
Be sure to include conda activate torch-env in your Slurm script.
In addition to Anaconda, Intel offers a version of PyTorch that has been optimized for Intel hardware as part of their AI Analytics Toolkit.
Example Job
The example below shows how to run a simple PyTorch script on one of the clusters. We will train a simple CNN on the MNIST data set. Begin by connecting to a head node on one of the clusters. Then clone the repo:
# ssh to a cluster $ cd /scratch/gpfs/<YourNetID> # or /scratch/network/<YourNetID> on Adroit $ git clone https://github.com/PrincetonUniversity/install_pytorch.git $ cd install_pytorch
This will create a folder called install_pytorch which contains the files needed to run this example. The compute nodes do not have internet access so we must obtain the data while on the head node:
$ python download_mnist.py
Inspect the PyTorch script called mnist_classify.py. Use a text editor like vim or emacs to enter your email address in job.slurm or delete the four lines concerned with email. Submit the job to the batch scheduler:
$ sbatch job.slurm
The Slurm script used for the job is below:
#!/bin/bash #SBATCH --job-name=torch-test # create a short name for your job #SBATCH --nodes=1 # node count #SBATCH --ntasks=1 # total number of tasks across all nodes #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem=4G # total memory per node (4 GB per cpu-core is default) #SBATCH --gres=gpu:1 # number of gpus per node #SBATCH --time=00:05:00 # total run time limit (HH:MM:SS) #SBATCH --mail-type=begin # send mail when job begins #SBATCH --mail-type=end # send mail when job ends #SBATCH --mail-type=fail # send mail if job fails #SBATCH --mail-user=<YourNetID>@princeton.edu module purge module load anaconda3/2020.11 conda activate torch-env python mnist_classify.py --epochs=3
You can monitor the status of the job with squeue -u $USER. Once the job runs, you'll have a slurm-xxxxx.out file in the install_pytorch directory. This log file contains both PyTorch and Slurm output.
Data Loading using Multiple CPU-cores
Watch this video on our YouTube channel for a demonstration.
Even when using a GPU there are still operations carried out on the CPU. Some of these operations have been written to take advantage of multiple CPU-cores such as data loading. Try different values for --cpus-per-task in combination with using a DataLoader to see if you get a speed-up:
#SBATCH --nodes=1 # node count #SBATCH --ntasks=1 # total number of tasks across all nodes #SBATCH --cpus-per-task=<T> # cpu-cores per task (>1 if multi-threaded tasks)
On each TigerGPU node there is a 7:1 ratio of CPU-cores to GPUs. Try doing a set of runs where you vary <T> from 1 to 7 to find the optimal value. Almost all PyTorch scripts show a significant performance improvement when using a DataLoader. In this case try setting num_workers equal to <T>. For the MNIST example above with <T> equal 4 and num_workers=4, there is a significant speed-up. Watch this video to learn about writing a custom DataLoader.
GPU Utilization
To see how effectively your job is using the GPU, after submiting the job run the following command:
$ squeue -u $USER
The rightmost column labeled "NODELIST(REASON)" gives the name of the node where your job is running. SSH to that node:
$ ssh tiger-iXXgYY
In the command above, you must replace XX and YY with the actual values (e.g., ssh tiger-i19g1). Once on the compute node run watch -n 1 gpustat. This will show you a percentage value indicating how effectively your code is using the GPU. The memory allocated to the GPU is also available. For the MNIST example above, in going from 1 to 8 data-loading workers the GPU utilization went from 18 to 55%. You should repeat this analysis with your actual research code to ensure that the GPU is being utilized. Be sure to learn about DataLoader and try increasing the value of cpus-per-task in tandem with num_workers to use multiple CPU-cores to prepare the data and keep the GPU busy. See the YouTube video above. You may also try varying the batch size.
Type Ctrl+C to exit the watch command. Use the exit command to leave the compute node and return to the head node.
For jobs that run for more than 10 minutes you can check utilization by looking at the TigerGPU utilization dashboard. See the bottom of that page for tips on improving utilization.
Distributed Training or Using Multiple GPUs
Most models can be trained in a reasonable amount of time using a single GPU. However, if you are effectively using the GPU as determined by the procedure above then you may consider running on multiple GPUs. In general this will lead to shorter training times but because more resources are required the queue time will increase. For any job submitted to the cluster you should choose the required resources (number of GPUs, number of CPU-cores, memory) that minimize the "time to finish" which is the time the job spends running on the compute nodes plus the time spent waiting in the queue. Do not assume that using all four GPUs on a node is the best choice, for instance.
The starting point for training PyTorch models on multiple GPUs is DistributedDataParallel which is the successor to DataParallel. In this approach, a copy of the model is assigned to each GPU where it operates on a different mini-batch. Keep in mind that by default the batch size is reduced when multiple GPUs are used. Be sure to use a DataLoader with multiple workers and the appropriate batch size to keep each GPU busy as discussed above.
For large models that do not fit in memory, there is the model parallel approach. In this case the model itself is distrbuted over multiple GPUs.
Also take a look at PyTorch Lightning and Horovod.
For hyperparameter tuning consider consider using a job array. This will allow you to run multiple jobs with one sbatch command. Each job within the array trains the network using a different set of parameters.
Building from Source
The directions for building PyTorch from source are here. The procedure below shows how those directions would be carried out on TigerGPU:
$ module load anaconda3/2020.11 $ conda create --name torch-env numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses $ conda activate torch-env $ conda install --channel pytorch magma-cuda102 $ module load cudatoolkit/10.2 cudnn/cuda-10.2/7.6.5 rh/devtoolset/8 $ git clone --recursive https://github.com/pytorch/pytorch $ cd pytorch $ export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"} $ python setup.py install
Containers
Instead of installing PyTorch through Anaconda, one can obtain it by downloading a container. The advantage of using containers is that they often include additional software, pre-trained models and/or data. Try browsing NVIDIA GPU Cloud (NGC) and elsewhere for useful containers such as TensorRT. Below is an example of running PyTorch via a container on TigerGPU:
$ cd /home/<YourNetID>/software # or another directory # consider using tag for latest version in line below $ singularity pull docker://nvcr.io/nvidia/pytorch:20.09-py3 $ cd /scratch/gpfs/<YourNetID> # or /scratch/network on adroit $ git clone https://github.com/PrincetonUniversity/install_pytorch $ cd install_pytorch $ singularity exec $HOME/software/pytorch_20.09-py3.sif python3 download_mnist.py $ sbatch job.slurm.ngc # edit email address
The PyTorch container from NGC includes Torchvision, Apex and more. For more see Containers on the HPC Clusters.
Working Interactively with Jupyter on TigerGPU
See our Jupyter page and this YouTube video for running PyTorch in a Juypter notebook on TigerGPU.
NVIDIA Apex
If you are running on Traverse or the V100 node of Adroit then you can take advantage of the Tensor Cores in those GPUs. The Apex library allows for automatic mixed-precision (AMP) training and distributed training. AMP has been part of the PyTorch core since version 1.6. For earlier versions of PyTorch you will need to install Apex from Anaconda or from source. When performing the installation from source make sure you use the same CUDA toolkit that was used for PyTorch.
$ ssh <YourNetID>@traverse.princeton.edu # or adroit but not tigergpu for AMP $ module load anaconda3/2020.11 rh/devtoolset/8 cudatoolkit/10.2 $ conda activate torch-env $ cd software # or another directory $ git clone https://github.com/NVIDIA/apex $ cd apex $ export TORCH_CUDA_ARCH_LIST="6.0;7.0" $ CUDA_HOME=/usr/local/cuda-10.2 pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
The speed-up comes from using the Tensor Cores on the GPU applied to matrix multiplications and convolutions. However, to use fp16 the dimension of each matrix must be a multiple of 8. Read about the constraints.
For simple PyTorch codes these are the necessary changes:
from apex import amp ... model, optimizer = amp.initialize(model, optimizer, opt_level="O1") ... with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward()
You cannot do mixed-precision operations on TigerGPU with its older P100 GPUs.
PyTorch Geometric
PyTorch Geometric is a geometric deep learning extension library for PyTorch. First build a Conda environment containing PyTorch as described above then follow the steps below. Make sure that your version of PyTorch matches that of the packages below (e.g., 1.8):
$ conda activate torch-env $ module load rh/devtoolset/8 $ CUDA="cu102" $ URL="https://pytorch-geometric.com/whl/torch" $ VERSION="1.8.0" $ pip install torch-scatter -f ${URL}-${VERSION}+${CUDA}.html $ pip install torch-sparse -f ${URL}-${VERSION}+${CUDA}.html $ pip install torch-cluster -f ${URL}-${VERSION}+${CUDA}.html $ pip install torch-spline-conv -f ${URL}-${VERSION}+${CUDA}.html $ pip install torch-geometric
TensorBoard
A useful tool for tracking the training progress of a PyTorch model is TensorBoard. This can be run on the head node in non-intensive cases. TensorBoard is available via Conda (see installation instructions for TigerGPU above). See the directions for setting up TensorBoard.
Profiling and Performance Tuning
For performance tuning tips see this presentation by Szymon Migacz from NVIDIA GTC 2021. See the PyTorch Performance Tuning page by the same author.
For profiling, in almost all cases you should start with line_profiler (see Python Profiling). Other tools also exist. If you are running on a GPU then you can use the NVIDIA profiler nvprof or nsys to profile you code. For the MNIST example on this page, the Slurm script would be modified as follows:
/usr/local/cuda-10.1/bin/nvprof python mnist_classify.py --epochs=3
The most expensive GPU and CPU operations are shown below:
GPU activities: Time(%) Time Calls Avg Name 33.16% 1.47105s 2811 523.32us maxwell_scudnn_128x32_stridedB_medium_nn 8.98% 398.53ms 2844 140.13us maxwell_scudnn_128x64_relu_interior_nn 8.57% 380.15ms 2814 135.09us maxwell_scudnn_128x64_stridedB_splitK_interior_nn ... API calls: 42.25% 3.43542s 26 132.13ms 5.1100us 3.43199s cudaMalloc 31.10% 2.52877s 234163 10.799us 5.0390us 2.4572ms cudaLaunchKernel 6.37% 517.94ms 1188571 435ns 254ns 1.7315ms cudaGetDevice ...
"API calls" refers to operations on the CPU. We see that memory allocation dominates the work carried out on the CPU. [CUDA memcpy HtoD] and [CUDA memcpy HtoD] refer to data transfer between the CPU or Host (H) and the GPU or Device (D).
Reproducibility
You may find variation in your results from run to run as described in the PyTorch docs. Here are a set of functions for setting the seed:
seed = 12345 np.random.seed(seed) torch.manual_seed(seed) torch.random.manual_seed(seed) torch.cuda.manual_seed(seed)
Note that even when the random number generators are seeded using the above code you may still see variation across identical runs.
Using PyCharm on TigerGPU
This video shows how to launch PyCharm on a TigerGPU compute node and use its debugger. While it was made using TensorFlow as the example application, the procedure also applies to PyTorch.
While debugging you may benefit from using unbuffered output of print statements. This can be achieved with:
print(<message>, flush=True)
And in the Slurm script add the -u option:
python -u <myscript.py>
More examples
More PyTorch example scripts are found here: https://github.com/pytorch/examples
How to Learn PyTorch
See the material and companion website (English and Chinese) of Prof. Alf Canziani of NYU. PyTorch also offers a 60-minute introduction.
There is also a free book.
Where to Store Your Files
You should run your jobs out of /scratch/gpfs/<YourNetID> on the HPC clusters. These filesystems are very fast and provide vast amounts of storage. Do not run jobs out of /tigress or /projects. That is, you should never be writing the output of actively running jobs to those filesystems. /tigress and /projects are slow and should only be used for backing up the files that you produce on /scratch/gpfs. Your /home directory on all clusters is small and it should only be used for storing source code and executables.
The commands below give you an idea of how to properly run a PyTorch job:
$ ssh <YourNetID>@tigergpu.princeton.edu $ cd /scratch/gpfs/<YourNetID> $ mkdir myjob && cd myjob # put PyTorch script and Slurm script in myjob $ sbatch job.slurm
If the run produces data that you want to backup then copy or move it to /tigress:
$ cp -r /scratch/gpfs/<YourNetID>/myjob /tigress/<YourNetID>
For large transfers consider using rsync instead of cp. Most users only do back-ups to /tigress every week or so. While /scratch/gpfs is not backed-up, files are never removed. However, important results should be transferred to /tigress or /projects.
The diagram below gives an overview of the filesystems:

Getting Help
If you encounter any difficulties while installing or running PyTorch on one of our HPC clusters then please send an email to cses@princeton.edu or attend a help session.