PyTorch on the Research Computing Clusters




PyTorch is a popular deep learning library for training artificial neural networks. The installation procedure depends on the cluster. If you are new to installing Python packages then see our Python page before continuing. Before installing make sure you have approximately 3 GB of free space in /home/<YourNetID> by running the checkquota command.

Della (GPU), Stellar (GPU) and Adroit (GPU)

The GPU nodes on Della (and one node of Adroit) feature the NVIDIA A100 GPU which benefits from the most recent version of the cudatoolkit (procedure below requires 9.3 GB of space so see the bottom of the checkquota page for tips on dealing with large Conda environments):

For version 2.x:

$ ssh <YourNetID>  # also adroit-vis, stellar-vis1, stellar-vis2
$ module load anaconda3/2024.2
$ conda create --name torch-env pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia
$ conda activate torch-env

For version 1.x:

$ ssh <YourNetID>  # also adroit-vis, stellar-vis1, stellar-vis2
$ module load anaconda3/2024.2
$ CONDA_OVERRIDE_CUDA="11.2" conda create --name torch-env "pytorch==1.13*=cuda11*" torchvision -c conda-forge
$ conda activate torch-env

Be sure to include conda activate torch-env in your Slurm script. Instead of installing via conda, one could also use the latest container from NVIDIA. See the docs on AMP for doing mixed-precision training with the A100. For more ways to optimize your PyTorch jobs see the "Performance Tuning Guide".

We are potentially seeing performance issues with PyTorch 2.0 when data is being read from GPFS.

You can install the transformers library by Hugging Face with this additional command:

$ conda activate torch-env
$ (torch-env) pip install transformers


PyTorch is available for the Power architecture with NVIDIA GPUs from an MIT Conda channel:

$ module load anaconda3/2023.9
$ CHNL=""
$ conda create --name torch-env --channel ${CHNL} "pytorch==1.12.1=cuda11*" torchvision
$ conda activate torch-env

Be sure to include conda activate torch-env and #SBATCH --gpus-per-node=1 in your Slurm script. For those doing distributed training with DDP, see the example on GitHub.

If you encounter problems when performing the installation then consider only using the MIT and conda-forge channels when performing the install (i.e., omit defaults).


PyTorch is also available from two IBM channels. For the latest stable release use:

$ CHNL=""

For an early access release, use this channel:

$ CHNL=""

Note that the latest version of PyTorch provided by IBM tends to lag behind the MIT version. In all cases, you can enter the URL into your web browser to see what is available.


See the directions below.

Tiger, Della, Stellar or Adroit (CPU)

$ module load anaconda3/2024.2
$ conda create --name torch-env pytorch torchvision torchaudio cpuonly --channel pytorch
$ conda activate torch-env

Be sure to include conda activate torch-env in your Slurm script.

In addition to Anaconda, Intel offers a version of PyTorch that has been optimized for Intel hardware as part of their AI Analytics Toolkit.


Example GPU Job

The example below shows how to run a simple PyTorch script on one of the clusters. We will train a simple CNN on the MNIST data set. Begin by connecting to a head node on one of the clusters. Then clone the repo:

# ssh to a cluster
$ cd /scratch/gpfs/<YourNetID>  # or /scratch/network/<YourNetID> on Adroit
$ git clone
$ cd install_pytorch

This will create a folder called install_pytorch which contains the files needed to run this example. The compute nodes do not have internet access so we must obtain the data while on the head node:

$ python

Inspect the PyTorch script called Use a text editor like vim or emacs to enter your email address in job.slurm or delete the four lines concerned with email. Submit the job to the batch scheduler:

$ sbatch job.slurm

The Slurm script used for the job is below:

#SBATCH --job-name=torch-test    # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=4G                 # total memory per node (4 GB per cpu-core is default)
#SBATCH --gres=gpu:1             # number of gpus per node
#SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send mail when job begins
#SBATCH --mail-type=end          # send mail when job ends
#SBATCH --mail-type=fail         # send mail if job fails
#SBATCH --mail-user=<YourNetID>

module purge
module load anaconda3/2024.2
conda activate torch-env

python --epochs=3

You can monitor the status of the job with squeue -u $USER. Once the job runs, you'll have a slurm-xxxxx.out file in the install_pytorch directory. This log file contains both PyTorch and Slurm output.


Data Loading using Multiple CPU-cores

Watch this video on our YouTube channel for a demonstration. For multi-GPU training see this workshop.

Even when using a GPU there are still operations carried out on the CPU. Some of these operations have been written to take advantage of multiple CPU-cores such as data loading. Try different values for --cpus-per-task in combination with using a DataLoader to see if you get a speed-up:

#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=<T>      # cpu-cores per task (>1 if multi-threaded tasks)

Almost all PyTorch scripts show a significant performance improvement when using a DataLoader. In this case try setting num_workers equal to <T>. Watch this video to learn about writing a custom DataLoader or read this PyTorch webpage.

Consider these external data loading libraries: ffcv and NVIDIA DALI.


GPU Utilization

To see how effectively your job is using the GPU, after submiting the job run the following command:

$ squeue -u $USER

The rightmost column labeled "NODELIST(REASON)" gives the name of the node where your job is running. SSH to that node:

$ ssh della-lXXgYY

In the command above, you must replace XX and YY with the actual values (e.g., ssh della-l03g12). Once on the compute node run watch -n 1 gpustat. This will show you a percentage value indicating how effectively your code is using the GPU. The memory allocated to the GPU is also available. For the MNIST example above, in going from 1 to 8 data-loading workers the GPU utilization went from 18 to 55%. You should repeat this analysis with your actual research code to ensure that the GPU is being utilized. Be sure to learn about DataLoader and try increasing the value of cpus-per-task in tandem with num_workers to use multiple CPU-cores to prepare the data and keep the GPU busy. See the YouTube video above.

Type Ctrl+C to exit the watch command. Use the exit command to leave the compute node and return to the head node.

You can view GPU utilization as a function of time using either the "gpudash" command or Job Stats.


Distributed Training or Using Multiple GPUs

Most models can be trained in a reasonable amount of time using a single GPU. However, if you are effectively using the GPU as determined by the procedure above then you may consider running on multiple GPUs. In general this will lead to shorter training times but because more resources are required the queue time will increase. For any job submitted to the cluster you should choose the required resources (number of GPUs, number of CPU-cores, memory) that minimize the "time to finish" which is the time the job spends running on the compute nodes plus the time spent waiting in the queue. Do not assume that using all of the GPUs on a node is the best choice, for instance. For more, see how to conduct a scaling analysis.

The starting point for training PyTorch models on multiple GPUs is DistributedDataParallel which is the successor to DataParallel. See this workshop for examples. Be sure to use a DataLoader with multiple workers to keep each GPU busy as discussed above.

Also take a look at PyTorch Lightning and see an example for this in our multi-GPU training workshop.

For large models that do not fit in memory, there is the model parallel approach. In this case the model itself is distrbuted over multiple GPUs.

For hyperparameter tuning consider consider using a job array. This will allow you to run multiple jobs with one sbatch command. Each job within the array trains the network using a different set of parameters.


Building from Source

The directions for building PyTorch from source are here. The procedure below shows how those directions would be carried out on TigerGPU:

$ module load anaconda3/2024.2
$ conda create --name torch-env numpy ninja pyyaml mkl mkl-include setuptools \
cmake cffi typing_extensions future six requests dataclasses
$ conda activate torch-env
$ conda install --channel pytorch magma-cuda102
$ module load cudatoolkit/10.2 cudnn/cuda-10.2/7.6.5 rh/devtoolset/8
$ git clone --recursive
$ cd pytorch
$ export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
$ python install


$ module load anaconda3/2023.9 cudatoolkit/11.3 cudnn/cuda-11.x/8.2.0
$ module load openmpi/cuda-11.0/gcc/4.0.4/64 nccl/cuda-11.1/2.7.8
$ conda create --name torch191 python=3.8 astunparse numpy ninja pyyaml \
setuptools cmake cffi typing_extensions future six requests dataclasses
$ conda activate torch191
$ git clone
$ cd pytorch/
$ git checkout tags/v1.9.1
$ python install



Instead of installing PyTorch through Anaconda, one can obtain it by downloading a container. The advantage of using containers is that they often include performance optimizations, additional software, pre-trained models and/or data. Try browsing NVIDIA GPU Cloud (NGC) for useful containers such as PyTorch and TensorRT. Below is an example of running PyTorch via a container on TigerGPU:

$ cd /home/<YourNetID>/software  # or another directory
# consider using tag for latest version in line below
$ singularity pull docker://
$ cd /scratch/gpfs/<YourNetID>  # or /scratch/network on adroit
$ git clone
$ cd install_pytorch
$ singularity exec $HOME/software/pytorch_23.09-py3.sif python3
$ sbatch job.slurm.ngc  # edit email address

The PyTorch container from NGC includes Torchvision, Apex and more. For more see Containers on the HPC Clusters.


Working Interactively with Jupyter on a GPU Node

This is best done using Jupyter in Open OnDemand. If for some reason you want to do this using salloc then see this YouTube video for running PyTorch on a GPU compute node.


Automatic Mixed Precision

Since version 1.6 the NVIDIA Apex library has been included in PyTorch as torch.cuda.amp. Mixed precision training requires either the V100 or A100 GPU. For earlier versions of PyTorch you will need to install Apex from Anaconda or from source. When performing the installation from source make sure you use the same CUDA toolkit that was used for PyTorch.

$ ssh <YourNetID>  # not tigergpu for AMP
$ module load anaconda3/2024.2 cudatoolkit/11.1
$ conda activate torch-env
$ cd software  # or another directory
$ git clone
$ cd apex
$ export TORCH_CUDA_ARCH_LIST="7.0;8.0"
$ pip install -v --no-cache-dir \
--global-option="--cpp_ext" --global-option="--cuda_ext" ./

The speed-up comes from using the Tensor Cores on the GPU applied to matrix multiplications and convolutions. However, to use fp16 the dimension of each matrix must be a multiple of 8. Read about the constraints.

For simple PyTorch codes these are the necessary changes:

from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:

You cannot do mixed-precision operations with P100 GPUs.


PyTorch Geometric

PyTorch Geometric is a geometric deep learning extension library for PyTorch. First build a Conda environment containing PyTorch as described above then follow the steps below:

$ conda activate torch-env
(torch-env) $ conda install pyg -c pyg



A useful tool for tracking the training progress of a PyTorch model is TensorBoard. This can be run on the head node in non-intensive cases. TensorBoard is available via Conda (see installation instructions for TigerGPU above). See the directions for setting up TensorBoard.


Profiling and Performance Tuning

For performance tuning tips see this presentation by Szymon Migacz from NVIDIA GTC 2021. See the PyTorch Performance Tuning page by the same author.

For profiling, in almost all cases you should start with line_profiler (see Python Profiling). Other tools also exist. If you are running on a GPU then you can use the NVIDIA profiler nvprof or nsys to profile you code. For the MNIST example on this page, the Slurm script would be modified as follows:

/usr/local/cuda-10.1/bin/nvprof python --epochs=3

The most expensive GPU and CPU operations are shown below:

GPU activities:
   Time(%)      Time     Calls       Avg  Name
    33.16%  1.47105s      2811  523.32us  maxwell_scudnn_128x32_stridedB_medium_nn
     8.98%  398.53ms      2844  140.13us  maxwell_scudnn_128x64_relu_interior_nn
     8.57%  380.15ms      2814  135.09us  maxwell_scudnn_128x64_stridedB_splitK_interior_nn
API calls:   
    42.25%  3.43542s        26  132.13ms  5.1100us  3.43199s  cudaMalloc
    31.10%  2.52877s    234163  10.799us  5.0390us  2.4572ms  cudaLaunchKernel
     6.37%  517.94ms   1188571     435ns     254ns  1.7315ms  cudaGetDevice

"API calls" refers to operations on the CPU. We see that memory allocation dominates the work carried out on the CPU. [CUDA memcpy HtoD] and [CUDA memcpy HtoD] refer to data transfer between the CPU or Host (H) and the GPU or Device (D).



You may find variation in your results from run to run as described in the PyTorch docs. Here are a set of functions for setting the seed:

seed = 12345

Note that even when the random number generators are seeded using the above code you may still see variation across identical runs.


Using PyCharm on TigerGPU

This video shows how to launch PyCharm on a TigerGPU compute node and use its debugger. While it was made using TensorFlow as the example application, the procedure also applies to PyTorch.

While debugging you may benefit from using unbuffered output of print statements. This can be achieved with:

print(<message>, flush=True)

And in the Slurm script add the -u option:

python -u <> 


More examples

More PyTorch example scripts are found here:


How to Learn PyTorch

See the material and companion website (English and Chinese) of Prof. Alf Canziani of NYU. PyTorch also offers a 60-minute introduction and tutorials.

There is also a free book.


Getting Help

If you encounter any difficulties while installing or running PyTorch on one of our HPC clusters then please send an email to [email protected] or attend a help session.