TensorFlow on the HPC Clusters

OUTLINE

 

Installation

TensorFlow is a popular deep learning library for training artificial neural networks. The installation instructions depend on the version and cluster. This page covers version 2.x. If you are new to installing Python packages then see our Python page before continuing. The installation of TensorFlow requires about 3 GB of space in your /home/<YourNetID> directory. Run the checkquota command before installing to see if you have sufficient space.

TigerGPU or Adroit (GPU)

TensorFlow 2.x is available on Anaconda Cloud:

$ module load anaconda3/2020.11
$ conda create --name tf2-gpu tensorflow-gpu <package-2> <package-3> ... <package-N>
$ conda activate tf2-gpu

For instance, to build an environment with the GPU version of TensorFlow, Matplotlib and Tensorboard:

$ module load anaconda3/2020.11
$ conda create --name tf2-gpu tensorflow-gpu matplotlib tensorboard
$ conda activate tf2-gpu

Be sure to include conda activate tf2-gpu and #SBATCH --gres=gpu:1in your Slurm script.

Traverse

LATEST RELEASE

TensorFlow 2.x is available for IBM's POWER architecture and NVIDIA GPUs via IBM's Conda channel, Watson Machine Learning Community Edition. Follow these installation directions:

$ module load anaconda3/2020.11
$ CHNL="https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda"
$ conda create --name=tf2-gpu --channel ${CHNL} tensorflow-gpu
$ conda activate tf2-gpu

Be sure to include conda activate tf2-gpu and #SBATCH --gres=gpu:1 in your Slurm script.

EARLY ACCESS VERSION

If you would benefit from a newer version of TensorFlow in pre-release form then use the early access channel instead: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access

TigerCPU, Della, Perseus or Adroit (CPU)

TensorFlow 2.x is available on Anaconda Cloud:

$ module load anaconda3/2020.11
$ conda create --name tf2-cpu tensorflow <package-2> <package-3> ... <package-N>
$ conda activate tf2-cpu

Be sure to include conda activate tf2-cpu in your Slurm script. See tips for running on CPUs.

 

Example Job

Test the installation by running a short job. First, download the necessary data. The compute nodes do not have internet access so we do the download on the head node:

$ python -c "import tensorflow as tf; tf.keras.datasets.mnist.load_data()"

The above command will download mnist.npz into the directory ~/.keras/datasets. To run the example follow these commands:

$ git clone https://github.com/PrincetonUniversity/slurm_mnist.git
$ cd slurm_mnist

Use a text editor like vim or emacs to enter your email address in job.slurm or delete the four lines concerned with email. Then submit the job:

$ sbatch job.slurm

You can monitor the status of the job with:

$ squeue -u $USER

Once the job runs, you'll have a slurm-xxxxx.out file in the slurm_mnist directory. This log file contains both TensorFlow and Slurm output.

Here is the Slurm script (job.slurm):

#!/bin/bash
#SBATCH --job-name=tf2-test      # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=4G                 # total memory per node (default is 4 GB per CPU-core)
#SBATCH --gres=gpu:1             # number of gpus per node
#SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
module load anaconda3/2020.11
conda activate tf2-gpu

python mnist_classify.py

 

Multithreaded Data Loading

Even when using a GPU there are still operations that are carried out on the CPU. Some of these operations have been written to take advantage of multithreading. Try different values of --cpus-per-task to see if you get a speed-up:

#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=<T>      # cpu-cores per task (>1 if multi-threaded tasks)

On TigerGPU, there are seven CPU-cores for every one GPU. Try doing a set of runs where you vary <T> from 1 to 7 to find the optimal value. If you are running on CPUs only then see this page.

Multithreading should give a substantial speed-up for large input pipelines via tf.data where the computations take place on the (multicore) CPU only. See this video for an introduction.

 

GPU Utilization

To see how effectively your job is using the GPU, after submitting the job run the following command:

$ squeue -u $USER

The rightmost column labeled "NODELIST(REASON)" gives the name of the node where your job is running. Connect to this node:

$ ssh tiger-iXXgYY

In the command above, you must replace XX and YY with the actual values (e.g., ssh tiger-i19g1). Once on the compute node run watch -n 1 gpustat. This will show you a percentage value indicating how effectively your code is using the GPU. The memory allocated to the GPU is also available. TensorFlow by default takes all available GPU memory. You should measure the GPU utilization for your research code to ensure that the GPU is being utilized effectively. Be sure to learn about tf.data and try increasing the value of cpus-per-task to use multiple CPU-cores to prepare the data and keep the GPU busy.

Type Ctrl+C to exit the watch command. Type exit to leave the compute node and return to the head node.

For jobs that run for more than 10 minutes you can check utilization by looking at the TigerGPU utilization dashboard. See the bottom of that page for tips on improving utilization.

Common Mistakes

TensorFlow will run GPU-enabled operations on the GPU by default. However, if you request more than one GPU in your Slurm script then TensorFlow will use one GPU and ignore the others unless you explicilty make the appropriate changes to your TensorFlow script. The changes that need to be made are covered in the next section. A second common mistake is to accidentally Conda install tensorflow instead of tensorflow-gpu. The former package cannot use GPUs.

 

Distributed Training or Using Multiple GPUs

Most models can be trained in a reasonable amount of time using a single GPU. However, if you are effectively using the GPU as determined by the procedure above then you may consider running on multiple GPUs. In general this will lead to shorter training times but because more resources are required the queue time will increase. For any job submitted to the cluster you should choose the required resources (number of GPUs, number of CPU-cores, memory) that minimize the "time to finish" which is the time the job spends running on the compute nodes plus the time spent waiting in the queue. Do not assume that using all four GPUs on a node is the best choice, for instance.

Models that are built using tf.keras can be made to run on multiple GPUs quite easily. This is done by using a data parallel approach where a copy of the model is assiged to each GPU and each copy operates on a different mini-batch. Using multiple GPUs is also easy for models defined through tf.estimator. TensorFlow offers ways to use multiple GPUs with the subclassing API as well (see tf.distribute and tutorials).

TensorFlow offers an approach for using multiple GPUs on multiple nodes. Horovod can also be used.

For hyperparameter tuning consider using a job array. This will allow you to run multiple jobs with one sbatch command. Each job within the array trains the network using a different set of parameters.

 

Containers

Instead of installing TensorFlow through Anaconda, one can obtain it by downloading a container. The advantage of using containers is that they often include additional software, trained models and/or data. Try browsing Docker HubNVIDIA GPU Cloud and elsewhere for useful containers such as TensorRT. Below is an example of running TensorFlow via a container on TigerGPU:

$ cd /home/<YourNetID>/software  # or another directory
# consider using tag for latest version in line below
$ singularity pull docker://nvcr.io/nvidia/tensorflow:20.09-tf2-py3
$ cd /scratch/gpfs/<YourNetID>  # or /scratch/network on adroit
$ git clone https://github.com/PrincetonUniversity/slurm_mnist
$ cd slurm_mnist
$ singularity exec $HOME/software/tensorflow_20.09-tf2-py3.sif python3 download_mnist.py
$ sbatch job.slurm.ngc  # edit email address

See our Singularity page for more info as well as this post by NVIDIA.

 

Suppressing INFO Statements

Add these lines to the top of your Python script to prevent INFO statements from appearing in the output:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'

The default value is '0' while the max is '3' where even error messages are suppressed.

 

TensorBoard

TensorBoard comes included in a Conda installation of TensorFlow. It can be used to view your graph, monitor training progress and more. It can be used on the head node of a cluster in non-intensive cases:

(tf2-gpu) [aturing@tigergpu ~]$ tensorboard --logdir /scratch/gpfs/<YourNetID>/myproj/log
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.1.0 at http://localhost:6006/ (Press CTRL+C to quit)

On your laptop or local machine, establish an ssh tunnel combined with port forwarding:

$ ssh -N -f -L 6006:127.0.0.1:6006 <YourNetID>@tigergpu.princeton.edu

Lastly, on your laptop point a web browser at the URL given when tensorboard was launched, for example: http://localhost:6006/

This procedure can be modified by launching tensorboard on the compute node where your job is running. If your job is running on node tiger-h12g10, for instance, then launch tensorboard with:

$ ssh tiger-h12g10
$ tensorboard --logdir /scratch/gpfs/<YourNetID>/myproj/log --bind_all

And setup the tunnel with this modification:

$ ssh -N -f -L 6006:tiger-h12g10:6006 <YourNetID>@tigergpu.princeton.edu

You will still point your web browser at http://localhost:6006/ assuming 6006 was the port you were assigned.

If you find that the above does not work then try specifying a port between 6006 and 6106 and repeat the procedure using that port, for example:

$ tensorboard --logdir /scratch/gpfs/<YourNetID>/myproj/log --bind_all --port 6042
$ ssh -N -f -L 6042:tiger-h12g10:6042 <YourNetID>@tigergpu.princeton.edu
http://localhost:6042/  # web browser url

When you are done using tensorboard, terminate the ssh tunnel on your local machine by running lsof -i tcp:6006 to get the PID and then kill -9 <PID> (e.g., kill -9 6010).

Tensorboard can be used for intensive cases on Tigressdata. Learn more about using graphics on the HPC clusters.

If you are working with TensorFlow in a Jupyter notebook then you can use the magic command: %tensorboard

 

Profiling

Profiling your TensorFlow script is the first step toward resolving performance bottlenecks and reducing your training time. See this guide for profiling a code that uses the Keras API. A more general procedure is described here.

One should also consider using line_profiler to profile specific functions in your script. NVIDIA Nsight Systems (nsys) can be used to produce a timeline of the execution of your code.

 

Using PyCharm on TigerGPU

This video shows how to launch PyCharm on a TigerGPU compute node and use its debugger on an actively running TensorFlow script.

While debugging you may benefit from using unbuffered output of print statements. This can be achieved with:

print(<message>, flush=True)

And in the Slurm script add the -u option:

python -u <myscript.py>

If you are debugging a TensorFlow script then it can be useful to generate the same numerical results for each run. One can achieve this by setting the seeds to two random number generators:

tf.random.set_seed(1234)
np.random.seed(4242)

 

Inference with TensorRT

Some TigerGPU users are making use of TensorRT, an SDK for high-performance deep learning inference. You can either use the container from NVIDIA with Singularity or build from source as described below.

Below are build directions for TigerGPU:

#!/bin/bash
module purge
module load anaconda3/2020.11
conda create --name trt72-env python=3.7 -y
conda activate trt72-env

cd $HOME/software  # then obtain the software from nvidia website
tar zxf TensorRT-7.2.1.6.CentOS-7.6.x86_64-gnu.cuda-11.1.cudnn8.0.tar.gz
cd TensorRT-7.2.1.6

module load cudatoolkit/11.1 cudnn/cuda-11.0/8.0.2
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/$USER/software/TensorRT-7.2.1.6/lib

cd python; pip install tensorrt-7.2.1.6-cp37-none-linux_x86_64.whl
cd ../uff; pip install uff-0.6.9-py2.py3-none-any.whl 
cd ../graphsurgeon; pip install graphsurgeon-0.4.5-py2.py3-none-any.whl 
cd ../onnx_graphsurgeon; pip install onnx_graphsurgeon-0.2.6-py2.py3-none-any.whl

The test the installation:

$ cd /home/$USER/software/TensorRT-7.2.1.6/data/mnist
$ python download_pgms.py
$ cd ../../samples/sampleMNIST
$ make

Then run the test on a compute node:

$ salloc -N 1 -n 1 -t 5 --gres=gpu:1
$ module purge
$ module load anaconda3/2020.11 cudatoolkit/11.1 cudnn/cuda-11.0/8.0.2
$ conda activate trt72-env
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/$USER/software/TensorRT-7.2.1.6/lib
$ cd /home/$USER/software/TensorRT-7.2.1.6/bin
$ ./sample_mnist
$ exit

The Cascade Lake nodes on Della are capable of Intel Vector Neural Net Instructions (VNNI) (a.k.a. DL Boost). The idea is to cast the FLOAT32 weights of your trained model to the INT8 data type.

 

R and Julia

See this post by Danny Simpson on using TensorFlow 1.x with R. For version 2.x see TensorFlow for R. Julia programmers should look at TensorFlow.jl for version 1.x or move to Flux.

 

Building from Source

At times it may be necessary to build TensorFlow from source. The procedure below was found to work on Della:

# ssh della
$ module load anaconda3/2020.11
# anaconda3 provides us with pip six numpy wheel setuptools mock
$ pip install -U --user keras_applications --no-deps pip install -U --user keras_preprocessing --no-deps

$ cd /tmp/bazel/
$ wget https://github.com/bazelbuild/bazel/releases/download/2.0.0/bazel-2.0.0-installer-linux-x86_64.sh
$ chmod +x bazel-2.0.0-installer-linux-x86_64.sh
$ ./bazel-2.0.0-installer-linux-x86_64.sh --prefix=/tmp/bazel
$ export PATH=/tmp/bazel/bin:$PATH

$ cd ~/sw
$ git clone https://github.com/tensorflow/tensorflow.git
$ cd tensorflow
$ ./configure  # took defaults or answered no
$ module load rh/devtoolset/7
$ CC=`which gcc` BAZEL_LINKLIBS=-l%:libstdc++.a bazel build --verbose_failures //tensorflow/tools/pip_package:build_pip_package
$ ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
$ pip install --user /tmp/tensorflow_pkg/tensorflow-2.1.0-cp37-cp37m-linux_x86_64.whl
$ cd
$ python
>>> import tensorflow as tf

 

How to Learn TensorFlow

See the tutorials and guides. See examples on GitHub.

 

Where to Store Your Files

You should run your jobs out of /scratch/gpfs/<YourNetID> on the HPC clusters. These filesystems are very fast and provide vast amounts of storage. Do not run jobs out of /tigress or /projects. That is, you should never be writing the output of actively running jobs to those filesystems. /tigress and /projects are slow and should only be used for backing up the files that you produce on /scratch/gpfs. Your /home directory on all clusters is small and it should only be used for storing source code and executables.

The commands below give you an idea of how to properly run a TensorFlow job:

$ ssh <YourNetID>@tigergpu.princeton.edu
$ cd /scratch/gpfs/<YourNetID>
$ mkdir myjob && cd myjob
# put TensorFlow script and Slurm script in myjob
$ sbatch job.slurm

If the run produces data that you want to backup then copy or move it to /tigress:

$ cp -r /scratch/gpfs/<YourNetID>/myjob /tigress/<YourNetID>

For large transfers consider using rsync instead of cp. Most users only do back-ups to /tigress every week or so. While /scratch/gpfs is not backed-up, files are never removed. However, important results should be transferred to /tigress or /projects.

The diagram below gives an overview of the filesystems:

HPC clusters and the filesystems that are available to each. Users should write job output to /scratch/gpfs.

 

Getting Help

If you encounter any difficulties while running TensorFlow on one of our HPC clusters then please send an email to cses@princeton.edu or attend a help session.

 

Acknowledgements

Kyle Felker has made improvements to this page.