A popular deep learning framework

Installation

PyTorch is a popular deep learning library for training artificial neural networks. The installation procedure depends on the cluster. If you are new to installing Python packages then see our Python page before continuing. Before installing make sure you have several GB of free space in /home/<YourNetID> by running the checkquota command.

Della (GPU/PLI), Tiger (GPU), Stellar (GPU) and Adroit (GPU)

Follow the directions below to install a GPU-enabled version of PyTorch (9 GB of space are required – see the checkquota page for tips on dealing with large Conda environments):

For version 2.x (recommended):

$ ssh <YourNetID>@della-gpu.princeton.edu  # also adroit-vis, stellar-vis1/2, tiger3-vis
$ module load anaconda3/2024.10
$ conda create --name torch-env python=3.12 -y
$ conda activate torch-env
(torch-env) $ pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu128
(torch-env) $ pip3 install ipykernel  # if using Jupyter OnDemand (e.g., MyDella)

For version 2.x using conda with some extra packages (make sure the login node has a GPU):

$ ssh <YourNetID>@della-gpu.princeton.edu  # also adroit-vis, stellar-vis1/2, tiger3-vis
$ module load anaconda3/2024.10
$ conda create --name torch-env "pytorch==2.6*=cuda12*" torchvision matplotlib ipykernel -c conda-forge -y
$ conda activate torch-env

In the instructions above, the ipykernel package is included so that Jupyter via Open OnDemand can be used.

For version 2.x using conda with some extra packages for LLMs (make sure the login node has a GPU):

$ ssh <YourNetID>@della-gpu.princeton.edu  # also adroit-vis, stellar-vis1/2, tiger3-vis
$ module load anaconda3/2024.10
$ conda create --name llm-env "pytorch==2.6*=cuda12*" torchvision transformers flash-attn ipykernel ipywidgets matplotlib -c conda-forge -y
$ conda activate llm-env

In the instructions above, the ipykernel package is included so that Jupyter via Open OnDemand can be used. See the Hugging Face getting started materials by Princeton Language and Intelligence and the Wintersession 2025 materials. Also, see our KB page for Hugging Face.

For version 1.x:

$ ssh <YourNetID>@della-gpu.princeton.edu  # also adroit-vis, stellar-vis1, tiger3-vis
$ module load anaconda3/2024.10
$ CONDA_OVERRIDE_CUDA="11.2" conda create --name torch-env "pytorch==1.13*=cuda11*" torchvision -c conda-forge
$ conda activate torch-env

Be sure to include conda activate torch-env in your Slurm script. Instead of installing via conda, one could also use the latest container from NVIDIA. See the docs on AMP for doing mixed-precision training with the A100. For more ways to optimize your PyTorch jobs see the "Performance Tuning Guide".

CPU-Only Installation

$ module load anaconda3/2024.10
$ conda create --name torch-env pytorch torchvision cpuonly --channel pytorch
$ conda activate torch-env

Be sure to include conda activate torch-env in your Slurm script.

In addition to Anaconda, Intel offers a version of PyTorch that has been optimized for Intel hardware as part of their AI Analytics Toolkit.

Troubleshooting Your PyTorch Installation

If you find that PyTorch is not using the GPU then test your installation by running the commands below on a login node or visualization node that has a GPU:

Adroit

$ ssh <YourNetID>@adroit-vis.princeton.edu
$ module load anaconda3/2024.10
$ conda activate torch-env
(torch-env) $ python -c "import torch; print(torch.cuda.is_available())"
True

Della

$ ssh <YourNetID>@della-gpu.princeton.edu
$ module load anaconda3/2024.10
$ conda activate torch-env
(torch-env) $ python -c "import torch; print(torch.cuda.is_available())"
True

Stellar

$ ssh <YourNetID>@stellar-vis1.princeton.edu  # or vis2
$ ...

Tiger

$ ssh <YourNetID>@tiger3-vis.princeton.edu
$ ...

If the result of the test above is “False” then something is wrong with your installation. Try following the installation directions again. Make sure you SSH to adroit-vis, della-gpu, stellar-vis1/2, or tiger3-vis.

 

Example GPU Job

The example below shows how to run a simple PyTorch script on one of the clusters. We will train a simple CNN on the MNIST data set. Begin by connecting to a head node on one of the clusters. Then clone the repo:

# ssh to a cluster
$ cd /scratch/gpfs/<YourNetID>  # or /scratch/network/<YourNetID> on Adroit
$ git clone https://github.com/PrincetonUniversity/install_pytorch.git
$ cd install_pytorch

This will create a folder called install_pytorch which contains the files needed to run this example. The compute nodes do not have internet access so we must obtain the data while on the head node:

$ python download_mnist.py

Inspect the PyTorch script called mnist_classify.py. Use a text editor like vim or emacs to enter your email address in job.slurm or delete the four lines concerned with email. Submit the job to the batch scheduler:

$ sbatch job.slurm

The Slurm script used for the job is below:

#!/bin/bash
#SBATCH --job-name=torch-test    # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=4        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=4G                 # total memory per node (4 GB per cpu-core is default)
#SBATCH --gres=gpu:1             # number of gpus per node
#SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send mail when job begins
#SBATCH --mail-type=end          # send mail when job ends
#SBATCH --mail-type=fail         # send mail if job fails
#SBATCH --mail-user=<YourNetID>@princeton.edu
module purge module load anaconda3/2024.10 conda activate torch-env
python mnist_classify.py --epochs=3

You can monitor the status of the job with squeue --me. Once the job runs, you'll have a slurm-xxxxx.out file in the install_pytorch directory. This log file contains both PyTorch and Slurm output.

   Data Loading using Multiple CPU-cores

Watch this video on our YouTube channel for a demonstration. For multi-GPU training see this workshop. For Hugging Face see the materials by Princeton Language and Intelligence.

Even when using a GPU there are still operations carried out on the CPU. Some of these operations have been written to take advantage of multiple CPU-cores such as data loading. Try different values for --cpus-per-task in combination with using a DataLoader to see if you get a speed-up:

#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=<T>      # cpu-cores per task (>1 if multi-threaded tasks)

Almost all PyTorch scripts show a significant performance improvement when using a DataLoader. In this case try setting num_workers equal to <T>. Watch this video to learn about writing a custom DataLoader or read this PyTorch webpage.

Consider these external data loading libraries: ffcv and NVIDIA DALI.

GPU Utilization

To see how effectively your job is using the GPU, after submiting the job run the following command:

$ squeue --me

The rightmost column labeled "NODELIST(REASON)" gives the name of the node where your job is running. SSH to that node:

$ ssh della-lXXgYY

In the command above, you must replace XX and YY with the actual values (e.g., ssh della-l03g12). Once on the compute node run watch -n 1 gpustat. This will show you a percentage value indicating how effectively your code is using the GPU. The memory allocated to the GPU is also available. For the MNIST example above, in going from 1 to 8 data-loading workers the GPU utilization went from 18 to 55%. You should repeat this analysis with your actual research code to ensure that the GPU is being utilized. Be sure to learn about DataLoader and try increasing the value of cpus-per-task in tandem with num_workers to use multiple CPU-cores to prepare the data and keep the GPU busy. See the YouTube video above.

Type Ctrl+C to exit the watch command. Use the exit command to leave the compute node and return to the head node.

You can view GPU utilization as a function of time using either the "gpudash" command or Job Stats.

Distributed Training or Using Multiple GPUs

Most models can be trained in a reasonable amount of time using a single GPU. However, if you are effectively using the GPU as determined by the procedure above then you may consider running on multiple GPUs. In general this will lead to shorter training times but because more resources are required the queue time will increase. For any job submitted to the cluster you should choose the required resources (number of GPUs, number of CPU-cores, memory) that minimize the "time to finish" which is the time the job spends running on the compute nodes plus the time spent waiting in the queue. Do not assume that using all of the GPUs on a node is the best choice, for instance. For more, see how to conduct a scaling analysis.

The starting point for training PyTorch models on multiple GPUs is DistributedDataParallel which is the successor to DataParallel. See this workshop for examples. Be sure to use a DataLoader with multiple workers to keep each GPU busy as discussed above.  For Hugging Face see the materials by Princeton Language and Intelligence and the Wintersession 2025 materials.

Also take a look at PyTorch Lightning and see an example for this in our multi-GPU training workshop.

For large models that do not fit in memory, there is the model parallel approach. In this case the model itself is distrbuted over multiple GPUs.

For hyperparameter tuning consider consider using a job array. This will allow you to run multiple jobs with one sbatch command. Each job within the array trains the network using a different set of parameters.

Building from Source

Here are the directions for building PyTorch from source. The procedure below shows how those directions would be carried out on TigerGPU:

ssh <YourNetID>@della-gpu.princeton.edu
module load anaconda3/2024.10
conda create --name torch-src python=3.12 -y
conda activate torch-src
cd software  # or another location
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
conda install cmake ninja
pip install -r requirements.txt
pip install mkl-static mkl-include
conda install -c pytorch magma-cuda126
export USE_XPU=0
export USE_ROCM=0
export _GLIBCXX_USE_CXX11_ABI=1
export CMAKE_PREFIX_PATH="${CONDA_PREFIX:-'$(dirname $(which conda))/../'}:${CMAKE_PREFIX_PATH}"
module load gcc-toolset/13
module load cudatoolkit/12.6
conda install cudnn -c conda-forge
python setup.py develop

    

Containers

Instead of installing PyTorch through Anaconda, one can obtain it by downloading a container. The advantage of using containers is that they often include performance optimizations, additional software, pre-trained models and/or data. Try browsing NVIDIA GPU Cloud (NGC) for useful containers such as PyTorch and TensorRT. Below is an example of running PyTorch via a container on TigerGPU:

$ cd /home/<YourNetID>/software  # or another directory
# consider using tag for latest version in line below
$ singularity pull docker://nvcr.io/nvidia/pytorch:pytorch:23.09-py3
$ cd /scratch/gpfs/<YourNetID>  # or /scratch/network on adroit
$ git clone https://github.com/PrincetonUniversity/install_pytorch
$ cd install_pytorch
$ singularity exec $HOME/software/pytorch_23.09-py3.sif python3 download_mnist.py
$ sbatch job.slurm.ngc  # edit email address

The PyTorch container from NGC includes Torchvision, Apex and more. For more see Containers on the HPC Clusters.

Working Interactively with Jupyter on a GPU Node

This is best done using Jupyter in Open OnDemand. If for some reason you want to do this using salloc then see this YouTube video for running PyTorch on a GPU compute node.

Automatic Mixed Precision

Since version 1.6 the NVIDIA Apex library has been included in PyTorch as torch.cuda.amp. Mixed precision training requires either the V100 or A100 GPU. For earlier versions of PyTorch you will need to install Apex from Anaconda or from source. When performing the installation from source make sure you use the same CUDA toolkit that was used for PyTorch.

$ ssh <YourNetID>@traverse.princeton.edu  # not tigergpu for AMP
$ module load anaconda3/2024.6 cudatoolkit/11.1
$ conda activate torch-env
$ cd software  # or another directory
$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ export TORCH_CUDA_ARCH_LIST="7.0;8.0"
$ pip install -v --no-cache-dir \
--global-option="--cpp_ext" --global-option="--cuda_ext" ./

The speed-up comes from using the Tensor Cores on the GPU applied to matrix multiplications and convolutions. However, to use fp16 the dimension of each matrix must be a multiple of 8. Read about the constraints.

For simple PyTorch codes these are the necessary changes:

from apex import amp
...
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
...
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

You cannot do mixed-precision operations with P100 GPUs.

PyTorch Geometric

PyTorch Geometric is a geometric deep learning extension library for PyTorch. First build a Conda environment containing PyTorch as described above then follow the steps below:

$ conda activate torch-env
(torch-env) $ conda install pyg -c pyg

TensorBoard

A useful tool for tracking the training progress of a PyTorch model is TensorBoard. This can be run on the head node in non-intensive cases. TensorBoard is available via Conda (see installation instructions for TigerGPU above). See the directions for setting up TensorBoard.

Profiling and Performance Tuning

For performance tuning tips see this presentation by Szymon Migacz from NVIDIA GTC 2021. See the PyTorch Performance Tuning page by the same author.

For profiling, in almost all cases you should start with line_profiler (see Python Profiling). Other tools also exist. If you are running on a GPU then you can use the NVIDIA profiler nsys to profile your code.

Reproducibility

You may find variation in your results from run to run as described in the PyTorch docs. Here are a set of functions for setting the seed:

seed = 12345
np.random.seed(seed)
torch.manual_seed(seed)
torch.random.manual_seed(seed)
torch.cuda.manual_seed(seed)

Note that even when the random number generators are seeded using the above code you may still see variation across identical runs.

Using PyCharm on TigerGPU

This video shows how to launch PyCharm on a TigerGPU compute node and use its debugger. While it was made using TensorFlow as the example application, the procedure also applies to PyTorch.

While debugging you may benefit from using unbuffered output of print statements. This can be achieved with:

print(<message>, flush=True)

And in the Slurm script add the -u option:

python -u <myscript.py> 

More examples

More PyTorch example scripts are found here: https://github.com/pytorch/examples

How to Learn PyTorch

See the material and companion website (Chinese) of Prof. Alf Canziani of NYU. PyTorch also offers a 60-minute introduction and tutorials.

There is also a free book.

Getting Help

If you encounter any difficulties while installing or running PyTorch on one of our HPC clusters then please send an email to [email protected] or attend a help session.