A popular deep learning framework InstallationPyTorch is a popular deep learning library for training artificial neural networks. The installation procedure depends on the cluster. If you are new to installing Python packages then see our Python page before continuing. Before installing make sure you have several GB of free space in /home/<YourNetID> by running the checkquota command.Della (GPU/PLI), Tiger (GPU), Stellar (GPU) and Adroit (GPU)Follow the directions below to install a GPU-enabled version of PyTorch (9 GB of space are required – see the checkquota page for tips on dealing with large Conda environments):For version 2.x (recommended):$ ssh <YourNetID>@della-gpu.princeton.edu # also adroit-vis, stellar-vis1/2, tiger3-vis $ module load anaconda3/2024.6 $ conda create --name torch-env pytorch torchvision pytorch-cuda=12.4 -c pytorch -c nvidia $ conda activate torch-env For version 1.x:$ ssh <YourNetID>@della-gpu.princeton.edu # also adroit-vis, stellar-vis1, tiger3-vis $ module load anaconda3/2024.6 $ CONDA_OVERRIDE_CUDA="11.2" conda create --name torch-env "pytorch==1.13*=cuda11*" torchvision -c conda-forge $ conda activate torch-env Be sure to include conda activate torch-env in your Slurm script. Instead of installing via conda, one could also use the latest container from NVIDIA. See the docs on AMP for doing mixed-precision training with the A100. For more ways to optimize your PyTorch jobs see the "Performance Tuning Guide".You can install the transformers library by Hugging Face with this additional command:$ conda activate torch-env $ (torch-env) pip install transformersSee the Hugging Face getting started materials by Princeton Language and Intelligence. Also, see our KB page for Hugging Face.Della, Stellar or Adroit (CPU)$ module load anaconda3/2024.6 $ conda create --name torch-env pytorch torchvision torchaudio cpuonly --channel pytorch $ conda activate torch-envBe sure to include conda activate torch-env in your Slurm script.In addition to Anaconda, Intel offers a version of PyTorch that has been optimized for Intel hardware as part of their AI Analytics Toolkit.Example GPU JobThe example below shows how to run a simple PyTorch script on one of the clusters. We will train a simple CNN on the MNIST data set. Begin by connecting to a head node on one of the clusters. Then clone the repo:# ssh to a cluster $ cd /scratch/gpfs/<YourNetID> # or /scratch/network/<YourNetID> on Adroit $ git clone https://github.com/PrincetonUniversity/install_pytorch.git $ cd install_pytorchThis will create a folder called install_pytorch which contains the files needed to run this example. The compute nodes do not have internet access so we must obtain the data while on the head node:$ python download_mnist.pyInspect the PyTorch script called mnist_classify.py. Use a text editor like vim or emacs to enter your email address in job.slurm or delete the four lines concerned with email. Submit the job to the batch scheduler:$ sbatch job.slurmThe Slurm script used for the job is below:#!/bin/bash #SBATCH --job-name=torch-test # create a short name for your job #SBATCH --nodes=1 # node count #SBATCH --ntasks=1 # total number of tasks across all nodes #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem=4G # total memory per node (4 GB per cpu-core is default) #SBATCH --gres=gpu:1 # number of gpus per node #SBATCH --time=00:05:00 # total run time limit (HH:MM:SS) #SBATCH --mail-type=begin # send mail when job begins #SBATCH --mail-type=end # send mail when job ends #SBATCH --mail-type=fail # send mail if job fails #SBATCH --mail-user=<YourNetID>@princeton.edu module purge module load anaconda3/2024.6 conda activate torch-env python mnist_classify.py --epochs=3You can monitor the status of the job with squeue -u $USER. Once the job runs, you'll have a slurm-xxxxx.out file in the install_pytorch directory. This log file contains both PyTorch and Slurm output.Data Loading using Multiple CPU-coresWatch this video on our YouTube channel for a demonstration. For multi-GPU training see this workshop. For Hugging Face see the materials by Princeton Language and Intelligence.Even when using a GPU there are still operations carried out on the CPU. Some of these operations have been written to take advantage of multiple CPU-cores such as data loading. Try different values for --cpus-per-task in combination with using a DataLoader to see if you get a speed-up:#SBATCH --nodes=1 # node count #SBATCH --ntasks=1 # total number of tasks across all nodes #SBATCH --cpus-per-task=<T> # cpu-cores per task (>1 if multi-threaded tasks)Almost all PyTorch scripts show a significant performance improvement when using a DataLoader. In this case try setting num_workers equal to <T>. Watch this video to learn about writing a custom DataLoader or read this PyTorch webpage.Consider these external data loading libraries: ffcv and NVIDIA DALI.GPU UtilizationTo see how effectively your job is using the GPU, after submiting the job run the following command:$ squeue -u $USERThe rightmost column labeled "NODELIST(REASON)" gives the name of the node where your job is running. SSH to that node:$ ssh della-lXXgYYIn the command above, you must replace XX and YY with the actual values (e.g., ssh della-l03g12). Once on the compute node run watch -n 1 gpustat. This will show you a percentage value indicating how effectively your code is using the GPU. The memory allocated to the GPU is also available. For the MNIST example above, in going from 1 to 8 data-loading workers the GPU utilization went from 18 to 55%. You should repeat this analysis with your actual research code to ensure that the GPU is being utilized. Be sure to learn about DataLoader and try increasing the value of cpus-per-task in tandem with num_workers to use multiple CPU-cores to prepare the data and keep the GPU busy. See the YouTube video above.Type Ctrl+C to exit the watch command. Use the exit command to leave the compute node and return to the head node.You can view GPU utilization as a function of time using either the "gpudash" command or Job Stats.Distributed Training or Using Multiple GPUsMost models can be trained in a reasonable amount of time using a single GPU. However, if you are effectively using the GPU as determined by the procedure above then you may consider running on multiple GPUs. In general this will lead to shorter training times but because more resources are required the queue time will increase. For any job submitted to the cluster you should choose the required resources (number of GPUs, number of CPU-cores, memory) that minimize the "time to finish" which is the time the job spends running on the compute nodes plus the time spent waiting in the queue. Do not assume that using all of the GPUs on a node is the best choice, for instance. For more, see how to conduct a scaling analysis.The starting point for training PyTorch models on multiple GPUs is DistributedDataParallel which is the successor to DataParallel. See this workshop for examples. Be sure to use a DataLoader with multiple workers to keep each GPU busy as discussed above. For Hugging Face see the materials by Princeton Language and Intelligence.Also take a look at PyTorch Lightning and see an example for this in our multi-GPU training workshop.For large models that do not fit in memory, there is the model parallel approach. In this case the model itself is distrbuted over multiple GPUs.For hyperparameter tuning consider consider using a job array. This will allow you to run multiple jobs with one sbatch command. Each job within the array trains the network using a different set of parameters.Building from SourceHere are the directions for building PyTorch from source. The procedure below shows how those directions would be carried out on TigerGPU:$ module load anaconda3/2024.6 $ conda create --name torch-env numpy ninja pyyaml mkl mkl-include setuptools \ cmake cffi typing_extensions future six requests dataclasses $ conda activate torch-env $ conda install --channel pytorch magma-cuda102 $ module load cudatoolkit/10.2 cudnn/cuda-10.2/7.6.5 rh/devtoolset/8 $ git clone --recursive https://github.com/pytorch/pytorch $ cd pytorch $ export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"} $ python setup.py install ContainersInstead of installing PyTorch through Anaconda, one can obtain it by downloading a container. The advantage of using containers is that they often include performance optimizations, additional software, pre-trained models and/or data. Try browsing NVIDIA GPU Cloud (NGC) for useful containers such as PyTorch and TensorRT. Below is an example of running PyTorch via a container on TigerGPU:$ cd /home/<YourNetID>/software # or another directory # consider using tag for latest version in line below $ singularity pull docker://nvcr.io/nvidia/pytorch:pytorch:23.09-py3 $ cd /scratch/gpfs/<YourNetID> # or /scratch/network on adroit $ git clone https://github.com/PrincetonUniversity/install_pytorch $ cd install_pytorch $ singularity exec $HOME/software/pytorch_23.09-py3.sif python3 download_mnist.py $ sbatch job.slurm.ngc # edit email address The PyTorch container from NGC includes Torchvision, Apex and more. For more see Containers on the HPC Clusters.Working Interactively with Jupyter on a GPU NodeThis is best done using Jupyter in Open OnDemand. If for some reason you want to do this using salloc then see this YouTube video for running PyTorch on a GPU compute node.Automatic Mixed PrecisionSince version 1.6 the NVIDIA Apex library has been included in PyTorch as torch.cuda.amp. Mixed precision training requires either the V100 or A100 GPU. For earlier versions of PyTorch you will need to install Apex from Anaconda or from source. When performing the installation from source make sure you use the same CUDA toolkit that was used for PyTorch.$ ssh <YourNetID>@traverse.princeton.edu # not tigergpu for AMP $ module load anaconda3/2024.6 cudatoolkit/11.1 $ conda activate torch-env $ cd software # or another directory $ git clone https://github.com/NVIDIA/apex $ cd apex $ export TORCH_CUDA_ARCH_LIST="7.0;8.0" $ pip install -v --no-cache-dir \ --global-option="--cpp_ext" --global-option="--cuda_ext" ./The speed-up comes from using the Tensor Cores on the GPU applied to matrix multiplications and convolutions. However, to use fp16 the dimension of each matrix must be a multiple of 8. Read about the constraints.For simple PyTorch codes these are the necessary changes:from apex import amp ... model, optimizer = amp.initialize(model, optimizer, opt_level="O1") ... with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward()You cannot do mixed-precision operations with P100 GPUs.PyTorch GeometricPyTorch Geometric is a geometric deep learning extension library for PyTorch. First build a Conda environment containing PyTorch as described above then follow the steps below:$ conda activate torch-env (torch-env) $ conda install pyg -c pygTensorBoardA useful tool for tracking the training progress of a PyTorch model is TensorBoard. This can be run on the head node in non-intensive cases. TensorBoard is available via Conda (see installation instructions for TigerGPU above). See the directions for setting up TensorBoard.Profiling and Performance TuningFor performance tuning tips see this presentation by Szymon Migacz from NVIDIA GTC 2021. See the PyTorch Performance Tuning page by the same author.For profiling, in almost all cases you should start with line_profiler (see Python Profiling). Other tools also exist. If you are running on a GPU then you can use the NVIDIA profiler nvprof or nsys to profile you code. For the MNIST example on this page, the Slurm script would be modified as follows:/usr/local/cuda-10.1/bin/nvprof python mnist_classify.py --epochs=3The most expensive GPU and CPU operations are shown below:GPU activities: Time(%) Time Calls Avg Name 33.16% 1.47105s 2811 523.32us maxwell_scudnn_128x32_stridedB_medium_nn 8.98% 398.53ms 2844 140.13us maxwell_scudnn_128x64_relu_interior_nn 8.57% 380.15ms 2814 135.09us maxwell_scudnn_128x64_stridedB_splitK_interior_nn ... API calls: 42.25% 3.43542s 26 132.13ms 5.1100us 3.43199s cudaMalloc 31.10% 2.52877s 234163 10.799us 5.0390us 2.4572ms cudaLaunchKernel 6.37% 517.94ms 1188571 435ns 254ns 1.7315ms cudaGetDevice ..."API calls" refers to operations on the CPU. We see that memory allocation dominates the work carried out on the CPU. [CUDA memcpy HtoD] and [CUDA memcpy HtoD] refer to data transfer between the CPU or Host (H) and the GPU or Device (D).ReproducibilityYou may find variation in your results from run to run as described in the PyTorch docs. Here are a set of functions for setting the seed:seed = 12345 np.random.seed(seed) torch.manual_seed(seed) torch.random.manual_seed(seed) torch.cuda.manual_seed(seed)Note that even when the random number generators are seeded using the above code you may still see variation across identical runs.Using PyCharm on TigerGPUThis video shows how to launch PyCharm on a TigerGPU compute node and use its debugger. While it was made using TensorFlow as the example application, the procedure also applies to PyTorch.While debugging you may benefit from using unbuffered output of print statements. This can be achieved with:print(<message>, flush=True)And in the Slurm script add the -u option:python -u <myscript.py> More examplesMore PyTorch example scripts are found here: https://github.com/pytorch/examplesHow to Learn PyTorchSee the material and companion website (English and Chinese) of Prof. Alf Canziani of NYU. PyTorch also offers a 60-minute introduction and tutorials.There is also a free book.Getting HelpIf you encounter any difficulties while installing or running PyTorch on one of our HPC clusters then please send an email to [email protected] or attend a help session.