TensorFlow on the HPC Clusters

OUTLINE

 

Installation

TensorFlow is a popular deep learning library for training artificial neural networks. The installation instructions depend on the version and cluster. This page covers version 2.x. If you are new to installing Python packages then see our Python page before continuing. The installation of TensorFlow requires about 3 GB of space in your /home/<YourNetID> directory. Run the checkquota command before installing to see if you have sufficient space.

Della-GPU

The GPU nodes on Della feature the new A100 GPUs so it is important to use the latest NVIDIA librares. The procedure below uses conda and pip but one could also use the latest NVIDIA container. Currently, if you install tensorflow-gpu using conda it will not be able to recognize the GPUs. Below is a working procedure that uses conda and pip (it requires 9 GB of space so see the bottom of the checkquota page for tips on dealing with large Conda environments if necessary):

$ ssh <YourNetID>@della-gpu.princeton.edu
$ module load anaconda3/2020.11
$ conda create --name tf2-gpu python=3.8 cudatoolkit=11 cudnn=8 --channel nvidia
$ conda activate tf2-gpu
$ pip install tensorflow-gpu==2.4
$ conda deactivate

Be sure to include conda activate tf2-gpu and #SBATCH --gres=gpu:1 in your Slurm script.

TigerGPU or Adroit (GPU)

TensorFlow 2.x is available on Anaconda Cloud (it requires at least 6.6 GB of space):

$ module load anaconda3/2020.11
$ conda create --name tf2-gpu tensorflow-gpu <package-2> <package-3> ... <package-N>
$ conda activate tf2-gpu

For instance, to build an environment with the GPU version of TensorFlow, Matplotlib and Tensorboard:

$ module load anaconda3/2020.11
$ conda create --name tf2-gpu tensorflow-gpu matplotlib tensorboard
$ conda activate tf2-gpu

Be sure to include conda activate tf2-gpu and #SBATCH --gres=gpu:1 in your Slurm script.

Traverse

LATEST RELEASE

TensorFlow 2.x is available for IBM's POWER architecture and NVIDIA GPUs via IBM's Conda channel, Watson Machine Learning Community Edition. Follow these installation directions:

$ module load anaconda3/2020.11
$ CHNL="https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda"
$ conda create --name=tf2-gpu --channel ${CHNL} tensorflow-gpu
$ conda activate tf2-gpu

Be sure to include conda activate tf2-gpu and #SBATCH --gres=gpu:1 in your Slurm script.

EARLY ACCESS VERSION

If you would benefit from a newer version of TensorFlow in pre-release form then use the early access channel instead: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access

TigerCPU, Della, Stellar or Adroit (CPU)

TensorFlow 2.x is available on Anaconda Cloud:

$ module load anaconda3/2020.11
$ conda create --name tf2-cpu tensorflow <package-2> <package-3> ... <package-N>
$ conda activate tf2-cpu

Be sure to include conda activate tf2-cpu in your Slurm script. See tips for running on CPUs.

 

Example Job

Test the installation by running a short job. First, download the necessary data. The compute nodes do not have internet access so we do the download on the head node:

$ python -c "import tensorflow as tf; tf.keras.datasets.mnist.load_data()"

The above command will download mnist.npz into the directory ~/.keras/datasets. To run the example follow these commands:

$ git clone https://github.com/PrincetonUniversity/slurm_mnist.git
$ cd slurm_mnist

Use a text editor like vim or emacs to enter your email address in job.slurm or delete the four lines concerned with email. Then submit the job:

$ sbatch job.slurm

You can monitor the status of the job with:

$ squeue -u $USER

Once the job runs, you'll have a slurm-xxxxx.out file in the slurm_mnist directory. This log file contains both TensorFlow and Slurm output.

Here is the Slurm script (job.slurm):

#!/bin/bash
#SBATCH --job-name=tf2-test      # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=4G                 # total memory per node (default is 4 GB per CPU-core)
#SBATCH --gres=gpu:1             # number of gpus per node
#SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
module load anaconda3/2020.11
conda activate tf2-gpu

python mnist_classify.py

 

Multithreaded Data Loading

Even when using a GPU there are still operations that are carried out on the CPU. Some of these operations have been written to take advantage of multithreading. Try different values of --cpus-per-task to see if you get a speed-up:

#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=<T>      # cpu-cores per task (>1 if multi-threaded tasks)

On TigerGPU, there are seven CPU-cores for every one GPU. Try doing a set of runs where you vary <T> from 1 to 7 to find the optimal value. If you are running on CPUs only then see this page.

Multithreading should give a substantial speed-up for large input pipelines via tf.data where the computations take place on the (multicore) CPU only. See this video for an introduction.

 

GPU Utilization

To see how effectively your job is using the GPU, after submitting the job run the following command:

$ squeue -u $USER

The rightmost column labeled "NODELIST(REASON)" gives the name of the node where your job is running. Connect to this node:

$ ssh tiger-iXXgYY

In the command above, you must replace XX and YY with the actual values (e.g., ssh tiger-i19g1). Once on the compute node run watch -n 1 gpustat. This will show you a percentage value indicating how effectively your code is using the GPU. The memory allocated to the GPU is also available. TensorFlow by default takes all available GPU memory. You should measure the GPU utilization for your research code to ensure that the GPU is being utilized effectively. Be sure to learn about tf.data and try increasing the value of cpus-per-task to use multiple CPU-cores to prepare the data and keep the GPU busy.

Type Ctrl+C to exit the watch command. Type exit to leave the compute node and return to the head node.

For jobs that run for more than 10 minutes you can check utilization by looking at the TigerGPU utilization dashboard. See the bottom of that page for tips on improving utilization.

Common Mistakes

TensorFlow will run GPU-enabled operations on the GPU by default. However, if you request more than one GPU in your Slurm script then TensorFlow will use one GPU and ignore the others unless you explicilty make the appropriate changes to your TensorFlow script. The changes that need to be made are covered in the next section. A second common mistake is to accidentally Conda install tensorflow instead of tensorflow-gpu. The former package cannot use GPUs.

 

Distributed Training or Using Multiple GPUs

Most models can be trained in a reasonable amount of time using a single GPU. However, if you are effectively using the GPU as determined by the procedure above then you may consider running on multiple GPUs. In general this will lead to shorter training times but because more resources are required the queue time will increase. For any job submitted to the cluster you should choose the required resources (number of GPUs, number of CPU-cores, memory) that minimize the "time to finish" which is the time the job spends running on the compute nodes plus the time spent waiting in the queue. Do not assume that using all four GPUs on a node is the best choice, for instance.

Models that are built using tf.keras can be made to run on multiple GPUs quite easily. This is done by using a data parallel approach where a copy of the model is assiged to each GPU and each copy operates on a different mini-batch. Using multiple GPUs is also easy for models defined through tf.estimator. TensorFlow offers ways to use multiple GPUs with the subclassing API as well (see tf.distribute and tutorials).

TensorFlow offers an approach for using multiple GPUs on multiple nodes. Horovod can also be used.

For hyperparameter tuning consider using a job array. This will allow you to run multiple jobs with one sbatch command. Each job within the array trains the network using a different set of parameters.

 

Containers

Instead of installing TensorFlow through Anaconda, one can obtain it by downloading a container. The advantage of using containers is that they often include performance optimizations, additional software, trained models and/or data. Try browsing NVIDIA GPU Cloud for useful containers such as TensorFlow and TensorRT. Below is an example of running TensorFlow via a container on TigerGPU:

$ cd /home/<YourNetID>/software  # or another directory
# consider using tag for latest version in line below
$ singularity pull docker://nvcr.io/nvidia/tensorflow:21.06-tf2-py3
$ cd /scratch/gpfs/<YourNetID>  # or /scratch/network on adroit
$ git clone https://github.com/PrincetonUniversity/slurm_mnist
$ cd slurm_mnist
$ singularity exec $HOME/software/tensorflow_21.06-tf2-py3.sif python3 download_mnist.py
$ sbatch job.slurm.ngc  # edit email address

See our Singularity page for more info as well as this post by NVIDIA.

 

Suppressing INFO Statements

Add these lines to the top of your Python script to prevent INFO statements from appearing in the output:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'

The default value is '0' while the max is '3' where even error messages are suppressed.

 

TensorBoard

TensorBoard comes included in a Conda installation of TensorFlow. It can be used to view your graph, monitor training progress and more. It can be used on the head node of a cluster in non-intensive cases:

(tf2-gpu) [aturing@tigergpu ~]$ tensorboard --logdir /scratch/gpfs/<YourNetID>/myproj/log
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.1.0 at http://localhost:6006/ (Press CTRL+C to quit)

On your laptop or local machine, establish an ssh tunnel combined with port forwarding:

$ ssh -N -f -L 6006:127.0.0.1:6006 <YourNetID>@tigergpu.princeton.edu

Lastly, on your laptop point a web browser at the URL given when tensorboard was launched, for example: http://localhost:6006/

This procedure can be modified by launching tensorboard on the compute node where your job is running. If your job is running on node tiger-h12g10, for instance, then launch tensorboard with:

$ ssh tiger-h12g10
$ tensorboard --logdir /scratch/gpfs/<YourNetID>/myproj/log --bind_all

And setup the tunnel with this modification:

$ ssh -N -f -L 6006:tiger-h12g10:6006 <YourNetID>@tigergpu.princeton.edu

You will still point your web browser at http://localhost:6006/ assuming 6006 was the port you were assigned.

If you find that the above does not work then try specifying a port between 6006 and 6106 and repeat the procedure using that port, for example:

$ tensorboard --logdir /scratch/gpfs/<YourNetID>/myproj/log --bind_all --port 6042
$ ssh -N -f -L 6042:tiger-h12g10:6042 <YourNetID>@tigergpu.princeton.edu
http://localhost:6042/  # web browser url

When you are done using tensorboard, terminate the ssh tunnel on your local machine by running lsof -i tcp:6006 to get the PID and then kill -9 <PID> (e.g., kill -9 6010).

Tensorboard can be used for intensive cases on Tigressdata. Learn more about using graphics on the HPC clusters.

If you are working with TensorFlow in a Jupyter notebook then you can use the magic command: %tensorboard

 

Profiling

Profiling your TensorFlow script is the first step toward resolving performance bottlenecks and reducing your training time.

line_profiler

An excellent starting point for profiling any Python script is line_profiler. This will provide high-level profiling data such as the amount of time spent on each line of your script.

dlprof

The NVIDIA profiler dlprof was designed for profiling deep learning frameworks such as TensorFlow and PyTorch. This profiler provides detailed information as wel as information about how the Tensor Cores are being used on the A100 GPUs. It also suggests various improvements. If possible you should use the TensorFlow container by NGC since it comes with all the needed software (see Containers above). If you don't use the container then make sure you install tensorboard using pip and not conda. You will also need to add code to your Python script. You should prefer Chrome over Firefox when choosing a web browser for tensorboard. The commands below illustrate the process assuming TensorFlow was installed as suggested above (read the dlprof documentation for details):

$ conda activate tf2-gpu
$ pip install nvidia-pyindex
$ pip install nvidia-dlprof
$ pip install tensorboard

See these presentation files by NVIDIA for an overview of the profiler and how to use it. There is also a video demonstration.

The last line of your Slurm script should be something like the following:

dlprof --mode=tensorflow2 --reports=detail python mnist_classify.py --epochs=1

Note that your job will run significantly slower. However, you only need to run for a short time to acquire the needed data. Once your job is complete launch tensorboard:

$ tensorboard --logdir ./event_files

See the tensorboard directions above for viewing the profiling data in your web browser. See the links above to the files and video that can help you interpret the results.

TF Profiler

See this guide for profiling a code that uses the Keras API. A more general procedure is described here.

nsys and ncu

NVIDIA Nsight Systems (nsys) can be used to produce a timeline of the execution of your code.

 

Performance Tuning

See this presentation from NVIDIA GTC 2021 for performance tips.

 

Using PyCharm on TigerGPU

This video shows how to launch PyCharm on a TigerGPU compute node and use its debugger on an actively running TensorFlow script.

While debugging you may benefit from using unbuffered output of print statements. This can be achieved with:

print(<message>, flush=True)

And in the Slurm script add the -u option:

python -u <myscript.py>

If you are debugging a TensorFlow script then it can be useful to generate the same numerical results for each run. One can achieve this by setting the seeds to two random number generators:

tf.random.set_seed(1234)
np.random.seed(4242)

 

Inference with TensorRT

Some TigerGPU users are making use of TensorRT, an SDK for high-performance deep learning inference. You can either use the container from NVIDIA with Singularity or build from source as described below.

Below are build directions for TigerGPU:

#!/bin/bash
module purge
module load anaconda3/2020.11
conda create --name trt72-env python=3.7 -y
conda activate trt72-env

cd $HOME/software  # then obtain the software from nvidia website
tar zxf TensorRT-7.2.1.6.CentOS-7.6.x86_64-gnu.cuda-11.1.cudnn8.0.tar.gz
cd TensorRT-7.2.1.6

module load cudatoolkit/11.1 cudnn/cuda-11.0/8.0.2
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/$USER/software/TensorRT-7.2.1.6/lib

cd python; pip install tensorrt-7.2.1.6-cp37-none-linux_x86_64.whl
cd ../uff; pip install uff-0.6.9-py2.py3-none-any.whl 
cd ../graphsurgeon; pip install graphsurgeon-0.4.5-py2.py3-none-any.whl 
cd ../onnx_graphsurgeon; pip install onnx_graphsurgeon-0.2.6-py2.py3-none-any.whl

The test the installation:

$ cd /home/$USER/software/TensorRT-7.2.1.6/data/mnist
$ python download_pgms.py
$ cd ../../samples/sampleMNIST
$ make

Then run the test on a compute node:

$ salloc -N 1 -n 1 -t 5 --gres=gpu:1
$ module purge
$ module load anaconda3/2020.11 cudatoolkit/11.1 cudnn/cuda-11.0/8.0.2
$ conda activate trt72-env
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/$USER/software/TensorRT-7.2.1.6/lib
$ cd /home/$USER/software/TensorRT-7.2.1.6/bin
$ ./sample_mnist
$ exit

The Cascade Lake nodes on Della are capable of Intel Vector Neural Net Instructions (VNNI) (a.k.a. DL Boost). The idea is to cast the FLOAT32 weights of your trained model to the INT8 data type.

 

R and Julia

See this post by Danny Simpson on using TensorFlow 1.x with R. For version 2.x see TensorFlow for R. Julia programmers should look at TensorFlow.jl for version 1.x or move to Flux.

 

Building from Source

At times it may be necessary to build TensorFlow from source. The procedure below can be used on Della-GPU:

$ git clone https://github.com/tensorflow/tensorflow.git
$ cd tensorflow
$ git checkout r2.4
$ module load anaconda3/2020.11
$ conda create --name tfsrc python=3.8 -y
$ conda activate tfsrc
$ pip install numpy==1.19 wheel
$ pip install keras_preprocessing --no-deps
$ ./configure

Please specify the location of python. [Default is /scratch/gpfs/aturing/CONDA/envs/tfsrc/bin/python3]: 
Please input the desired Python library path to use.  Default is [/scratch/gpfs/aturing/CONDA/envs/tfsrc/lib/python3.8/site-packages]
Do you wish to build TensorFlow with ROCm support? [y/N]: N
Do you wish to build TensorFlow with CUDA support? [y/N]: y
Do you wish to build TensorFlow with TensorRT support? [y/N]: N
Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 10]: 11.1
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 8.2.0
Please specify the locally installed NCCL version you want to use. [Leave empty to use http://github.com/nvidia/nccl]:
Please specify the comma-separated list of base paths to look for CUDA libraries and headers. [Leave empty to use the default]: 
/usr/local/cudnn/cuda-11.3/8.2.0/lib64,/usr/local/cudnn/cuda-11.3/8.2.0/include,/usr/local/cuda-11.1/lib64,/usr/local/cuda-11.1/include,/usr/local/cuda-11.1/bin,/usr/local/cuda-11.1
Please specify a list of comma-separated CUDA compute capabilities you want to build with [...]: 8.0
Do you want to use clang as CUDA compiler? [y/N]: N
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -Wno-sign-compare]: 
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: N

$ export TMP=/tmp
$ bazel build --config=cuda //tensorflow/tools/pip_package:build_pip_package
$ ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
$ pip install /tmp/tensorflow_pkg/tensorflow-2.4.2-cp38-cp38-linux_x86_64.whl
$ conda deactivate

# clean the cache
$ chmod -R 744 ~/.cache/bazel
$ rm -rf ~/.cache/bazel

The procedure below was found to work on Della (CPU) in 2020:

# ssh della
$ module load anaconda3/2020.11
# anaconda3 provides us with pip six numpy wheel setuptools mock
$ pip install -U --user keras_applications --no-deps pip install -U --user keras_preprocessing --no-deps

$ cd /tmp/bazel/
$ wget https://github.com/bazelbuild/bazel/releases/download/2.0.0/bazel-2.0.0-installer-linux-x86_64.sh
$ chmod +x bazel-2.0.0-installer-linux-x86_64.sh
$ ./bazel-2.0.0-installer-linux-x86_64.sh --prefix=/tmp/bazel
$ export PATH=/tmp/bazel/bin:$PATH

$ cd ~/sw
$ git clone https://github.com/tensorflow/tensorflow.git
$ cd tensorflow
$ ./configure  # took defaults or answered no
$ module load rh/devtoolset/7
$ CC=`which gcc` BAZEL_LINKLIBS=-l%:libstdc++.a bazel build --verbose_failures //tensorflow/tools/pip_package:build_pip_package
$ ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
$ pip install --user /tmp/tensorflow_pkg/tensorflow-2.1.0-cp37-cp37m-linux_x86_64.whl
$ cd
$ python
>>> import tensorflow as tf

 

How to Learn TensorFlow

See the tutorials and guides. See examples on GitHub.

 

Where to Store Your Files

You should run your jobs out of /scratch/gpfs/<YourNetID> on the HPC clusters. These filesystems are very fast and provide vast amounts of storage. Do not run jobs out of /tigress or /projects. That is, you should never be writing the output of actively running jobs to those filesystems. /tigress and /projects are slow and should only be used for backing up the files that you produce on /scratch/gpfs. Your /home directory on all clusters is small and it should only be used for storing source code and executables.

The commands below give you an idea of how to properly run a TensorFlow job:

$ ssh <YourNetID>@tigergpu.princeton.edu
$ cd /scratch/gpfs/<YourNetID>
$ mkdir myjob && cd myjob
# put TensorFlow script and Slurm script in myjob
$ sbatch job.slurm

If the run produces data that you want to backup then copy or move it to /tigress:

$ cp -r /scratch/gpfs/<YourNetID>/myjob /tigress/<YourNetID>

For large transfers consider using rsync instead of cp. Most users only do back-ups to /tigress every week or so. While /scratch/gpfs is not backed-up, files are never removed. However, important results should be transferred to /tigress or /projects.

The diagram below gives an overview of the filesystems:

HPC clusters and the filesystems that are available to each. Users should write job output to /scratch/gpfs.

 

Getting Help

If you encounter any difficulties while running TensorFlow on one of our HPC clusters then please send an email to cses@princeton.edu or attend a help session.

 

Acknowledgements

Kyle Felker has made improvements to this page.