TensorFlow on the Research Computing Clusters




TensorFlow is a popular deep learning library for training artificial neural networks. The installation instructions depend on the version and cluster. This page covers version 2.x. If you are new to installing Python packages then see our Python page before continuing. Run the checkquota command before installing to see if you have sufficient space.

Della (GPU) and Adroit (GPU)

Below are installation directions for using Conda (preferred) and Pip. One could also use the latest NVIDIA container or build from source. XLA support and additional optimizations have been added to version 2.8.1 via Conda (see directions below).


The Conda installation requires 8 GB of space so see the bottom of the checkquota page for tips on dealing with large Conda environments if necessary. Run the commands below to install a GPU version of TensorFlow:

$ ssh <YourNetID>@della-gpu.princeton.edu  # or adroit-vis
$ module load anaconda3/2023.9
$ conda create --name tf2-gpu tensorflow-gpu --channel conda-forge
$ conda activate tf2-gpu

Or to install the software with additional packages such as matplotlib, pandas and ipykernel to use with OnDemand (e.g., mydella):

$ module load anaconda3/2023.9
$ conda create --name tf2-gpu tensorflow-gpu matplotlib pandas ipykernel --channel conda-forge
$ conda activate tf2-gpu

If you encounter the following:

Collecting package metadata (current_repodata.json): done
Solving environment: unsuccessful attempt using repodata from current_repodata.json,
retrying with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: / Terminated

Then change the default solver to libmamba as explained here.

Be sure to perform the install on a login node that has a GPU such as della-gpu, adroit-vis, stellar-vis1 or stellar-vis2.

In all cases, be sure to include conda activate tf2-gpu and #SBATCH --gres=gpu:1 in your Slurm script as shown in the Example Job below.

If you need a version of TensorFlow that is newer than provided by the above approach then follow the pip directions below.


We recommend that users follow the Conda directions above to install TensorFlow because Conda takes care of the dependencies. However, if desired one may use pip. When using pip there are two steps to installing and running TensorFlow. First, carry out the installation which requires 2 GB of space:

$ ssh <YourNetID>@della-gpu.princeton.edu  # or adroit
$ module load anaconda3/2023.9
$ conda create --name tf2-gpu python=3.10
$ conda activate tf2-gpu
$ pip install tensorflow==2.14 --no-cache-dir  # or a newer version
$ conda deactivate

Second, you must load the following environment modules in your Slurm scripts and for interactive work:

# adroit
module load cudatoolkit/11.7
module load cudnn/cuda-11.5/8.3.2
# della
module load cudatoolkit/11.7
module load cudnn/cuda-11.x/8.2.0

If you fail to load the modules then you will see an error such as:

Could not load dynamic library 'libcudnn.so.8'

The two modules must also be loaded in your Slurm script.

A Note about XLA

In some cases TensorFlow will need to use a compiler suite (e.g., for XLA). You can make the compilers available by loading this module:

module load nvhpc/22.5


TensorFlow is available for the Power architecture with NVIDIA GPUs from an MIT Conda channel:

$ module load anaconda3/2023.3
$ CHNL="https://opence.mit.edu/#/"
$ conda create --name=tf2-gpu --channel ${CHNL} tensorflow
$ conda activate tf2-gpu

Be sure to include conda activate tf2-gpu and #SBATCH --gres=gpu:1 in your Slurm script.


TensorFlow is also available from two IBM channels. For the latest stable release use:

$ CHNL="https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda"
$ conda create --name=tf2-gpu --channel ${CHNL} tensorflow-gpu

For an early access release, use this channel:

$ CHNL="https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access"

Note that the latest version of TensorFlow provided by IBM tends to lag behind the MIT version. In all cases, you can enter the URL into your web browser to see what is available.

Tiger, Della (CPU), Stellar or Adroit (CPU)

TensorFlow 2.x for multicore CPUs can be installed as follows:

$ module load anaconda3/2023.3
$ conda create --name tf2-cpu tensorflow <package-2> <package-3> ... <package-N> --channel conda-forge
$ conda activate tf2-cpu

Be sure to include conda activate tf2-cpu in your Slurm script. See tips for running on CPUs.


Example Job

Test the installation of the GPU version of TensorFlow by running a short job. First, download the necessary data. The compute nodes do not have internet access so we do the download on the login node:

$ python -c "import tensorflow as tf; tf.keras.datasets.mnist.load_data()"

The above command will download mnist.npz into the directory ~/.keras/datasets. To run the example follow these commands:

$ git clone https://github.com/PrincetonUniversity/slurm_mnist.git
$ cd slurm_mnist

Use a text editor like vim or emacs to enter your email address in job.slurm or delete the three lines concerned with email. Then submit the job:

$ sbatch job.slurm

You can monitor the status of the job with:

$ squeue -u $USER

Once the job runs, you'll have a slurm-xxxxx.out file in the slurm_mnist directory. This log file contains both TensorFlow and Slurm output.

Here is the Slurm script (job.slurm):

#SBATCH --job-name=tf2-test      # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=4G                 # total memory per node (default is 4 GB per CPU-core)
#SBATCH --gres=gpu:1             # number of gpus per node
#SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
module load anaconda3/2023.3
conda activate tf2-gpu

python mnist_classify.py


Multithreaded Data Loading

Even when using a GPU there are still operations that are carried out on the CPU. Some of these operations have been written to take advantage of multithreading. Try different values of --cpus-per-task to see if you get a speed-up:

#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=<T>      # cpu-cores per task (>1 if multi-threaded tasks)

On TigerGPU, there are seven CPU-cores for every one GPU. Try doing a set of runs where you vary <T> from 1 to 7 to find the optimal value. If you are running on CPUs only then see this page.

Multithreading should give a substantial speed-up for large input pipelines via tf.data where the ETL takes place on the (multicore) CPU only. See this video for an introduction and this graphical illustration. You will likely want code that looks like this:

train_dataset = datasets['train'].map(preprocess_data, num_parallel_calls=tf.data.AUTOTUNE) \
                                 .cache() \
                                 .shuffle(SHUFFLE_SIZE) \
                                 .batch(BATCH_SIZE) \


GPU Utilization

To see how effectively your job is using the GPU, after submitting the job run the following command:

$ squeue -u $USER

The rightmost column labeled "NODELIST(REASON)" gives the name of the node where your job is running. Connect to this node:

$ ssh della-lXXgYY

In the command above, you must replace XX and YY with the actual values (e.g., ssh della-l03g12). Once on the compute node run watch -n 1 gpustat. This will show you a percentage value indicating how effectively your code is using the GPU. The memory allocated to the GPU is also available. TensorFlow by default takes all available GPU memory. You should measure the GPU utilization for your code to ensure that the GPU is being utilized effectively. Be sure to learn about tf.data (see section above) and try increasing the value of cpus-per-task to use multiple CPU-cores to prepare the data and keep the GPU busy.

Type Ctrl+C to exit the watch command. Type exit to leave the compute node and return to the head node.

You can use commands like "jobstats" and "gpudash" to monitor GPU utilization as explained on the GPU Computing page.

Common Mistakes

TensorFlow will run GPU-enabled operations on the GPU by default. However, if you request more than one GPU in your Slurm script then TensorFlow will use one GPU and ignore the others unless you explicilty make the appropriate changes to your TensorFlow script (see next section).


Distributed Training or Using Multiple GPUs

Most models can be trained in a reasonable amount of time using a single GPU. However, if you are effectively using the GPU as determined by the procedure above then you may consider running on multiple GPUs. In general this will lead to shorter training times but because more resources are required the queue time will increase. For any job submitted to the cluster you should choose the required resources (number of GPUs, number of CPU-cores, memory) that minimize the "time to finish" which is the time the job spends running on the compute nodes plus the time spent waiting in the queue. Do not assume that using all four GPUs on a node is the best choice, for instance.

Models that are built using tf.keras can be made to run on multiple GPUs quite easily (see an example from a Princeton workshop). This is done by using a data parallel approach where a copy of the model is assiged to each GPU and each copy operates on a different mini-batch. Using multiple GPUs is also easy for models defined through tf.estimator. TensorFlow offers ways to use multiple GPUs with the subclassing API as well (see tf.distribute).

TensorFlow offers an approach for using multiple GPUs on multiple nodes. Horovod can also be used.

For hyperparameter tuning consider using a job array. This will allow you to run multiple jobs with one sbatch command. Each job within the array trains the network using a different set of parameters.



Instead of installing TensorFlow through Anaconda, one can obtain it by downloading a container. The advantage of using containers is that they often include performance optimizations, additional software, trained models and/or data. Try browsing NVIDIA GPU Cloud for useful containers such as TensorFlow and TensorRT. Below is an example of running TensorFlow via a container on TigerGPU:

$ cd /home/<YourNetID>/software  # or another directory
# consider using tag for latest version in line below
$ singularity pull docker://nvcr.io/nvidia/tensorflow:22.09-tf2-py3
$ cd /scratch/gpfs/<YourNetID>  # or /scratch/network on adroit
$ git clone https://github.com/PrincetonUniversity/slurm_mnist
$ cd slurm_mnist
$ singularity exec $HOME/software/tensorflow_22.09-tf2-py3.sif python3 download_mnist.py
$ sbatch job.slurm.ngc  # edit email address

See our Singularity page for more info as well as this post by NVIDIA. Note that the information about using multiple data loaders described above still applies here.


Suppressing INFO Statements

Add these lines to the top of your Python script to prevent INFO statements from appearing in the output:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'

The default value is '0' while the max is '3' where even error messages are suppressed.



TensorBoard comes included in a Conda installation of TensorFlow. It can be used to view your graph, monitor training progress and more. The example below shows "aturing" as the user.

Step 1: Submit the job:

aturing@della-gpu:~$ sbatch job.slurm

When the job is running (not waiting in the queue), find the node where it is running:

aturing@della-gpu:~$ squeue --me
49227561   gputest   tf2-test  aturing  R   0:39    1     della-l09g6

Step 2: SSH to the compute node where the job is running and load the environment:

aturing@della-gpu:~$ ssh della-l09g6  # replace della-l09g6 with your node (see above)
aturing@della-l09g6:~$ module load anaconda3/2023
aturing@della-l09g6:~$ conda activate tf2-gpu

Step 3: Launch Tensorboard (you should specify a port around 9100 and do not use anything around 6006):

(tf2-gpu) aturing@della-l09g6:~$ tensorboard --logdir /scratch/gpfs/$USER --bind_all --port 9100

On Adroit, replace /scratch/gpfs with /scratch/network.

Step 4: Open a second terminal on your local machine (e.g., laptop). Run the command below in the second terminal (using the correct port) to establish an SSH tunnel combined with port forwarding:

$ ssh -N -f -L 9100:della-l09g6:9100 <YourNetID>@della-gpu.princeton.edu

Step 5: Lastly, on your laptop point a web browser at the URL using the correct port (replace 9100 below if you used a different port):


If you find that the above does not work then try specifying a port between 9100 and 9200 and repeat the procedure using that port. (You can also try our get_free_port command)

When you are done using tensorboard, terminate the SSH tunnel on your local machine by running lsof -i tcp:9100 to get the PID and then kill -9 <PID> (e.g., kill -9 7210).

If you are working with TensorFlow in a Jupyter notebook then you can use the magic command: %tensorboard

You can save time when making the tunnel by using a shell function.



Profiling your TensorFlow script is the first step toward resolving performance bottlenecks and reducing your training time.


An excellent starting point for profiling any Python script is line_profiler. This will provide high-level profiling data such as the amount of time spent on each line of your script.

TF Profiler

See this guide for profiling a code that uses the Keras API. A more general procedure is described here.

nsys and ncu

NVIDIA Nsight Systems (nsys) can be used to produce a timeline of the execution of your code.


Performance Tuning

See this presentation from NVIDIA GTC 2021 for performance tips.


Using PyCharm on TigerGPU

This video shows how to launch PyCharm on a TigerGPU compute node and use its debugger on an actively running TensorFlow script.

While debugging you may benefit from using unbuffered output of print statements. This can be achieved with:

print(<message>, flush=True)

And in the Slurm script add the -u option:

python -u <myscript.py>

If you are debugging a TensorFlow script then it can be useful to generate the same numerical results for each run. One can achieve this by setting the seeds to two random number generators:



Inference with TensorRT

Some TigerGPU users are making use of TensorRT, an SDK for high-performance deep learning inference. You can either use the container from NVIDIA with Singularity or build from source as described below.

Below are build directions which can be used as a guide:

module purge
module load anaconda3/2023.3
conda create --name trt72-env python=3.7 -y
conda activate trt72-env

cd $HOME/software  # then obtain the software from nvidia website
tar zxf TensorRT-
cd TensorRT-

module load cudatoolkit/11.3 cudnn/cuda-11.x/8.2.0
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/$USER/software/TensorRT-

cd python; pip install tensorrt-
cd ../uff; pip install uff-0.6.9-py2.py3-none-any.whl 
cd ../graphsurgeon; pip install graphsurgeon-0.4.5-py2.py3-none-any.whl 
cd ../onnx_graphsurgeon; pip install onnx_graphsurgeon-0.2.6-py2.py3-none-any.whl

The test the installation:

$ cd /home/$USER/software/TensorRT-
$ python download_pgms.py
$ cd ../../samples/sampleMNIST
$ make

Then run the test on a compute node:

$ salloc -N 1 -n 1 -t 5 --gres=gpu:1
$ module purge
$ module load anaconda3/2023.3 cudatoolkit/11.3 cudnn/cuda-11.x/8.2.0
$ conda activate trt72-env
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/$USER/software/TensorRT-
$ cd /home/$USER/software/TensorRT-
$ ./sample_mnist
$ exit

The Cascade Lake nodes on Della are capable of Intel Vector Neural Net Instructions (VNNI) (a.k.a. DL Boost). The idea is to cast the FLOAT32 weights of your trained model to the INT8 data type.


R and Julia

See this post by Danny Simpson on using TensorFlow 1.x with R. For version 2.x see TensorFlow for R. Julia programmers should look at TensorFlow.jl for version 1.x or move to Flux.


Building from Source

At times it may be necessary to build TensorFlow from source. The procedure below can be used as a guide on, for example, Della-GPU:

$ git clone https://github.com/tensorflow/tensorflow.git
$ cd tensorflow
$ git checkout r2.4
$ module load anaconda3/2023.3
$ conda create --name tfsrc python=3.8 -y
$ conda activate tfsrc
$ pip install numpy==1.19 wheel
$ pip install keras_preprocessing --no-deps
$ ./configure

Please specify the location of python. [Default is /scratch/gpfs/aturing/CONDA/envs/tfsrc/bin/python3]: 
Please input the desired Python library path to use.  Default is [/scratch/gpfs/aturing/CONDA/envs/tfsrc/lib/python3.8/site-packages]
Do you wish to build TensorFlow with ROCm support? [y/N]: N
Do you wish to build TensorFlow with CUDA support? [y/N]: y
Do you wish to build TensorFlow with TensorRT support? [y/N]: N
Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 10]: 11.1
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 8.2.0
Please specify the locally installed NCCL version you want to use. [Leave empty to use http://github.com/nvidia/nccl]:
Please specify the comma-separated list of base paths to look for CUDA libraries and headers. [Leave empty to use the default]: 
Please specify a list of comma-separated CUDA compute capabilities you want to build with [...]: 8.0
Do you want to use clang as CUDA compiler? [y/N]: N
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -Wno-sign-compare]: 
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: N

$ export TMP=/tmp
$ bazel build --config=cuda //tensorflow/tools/pip_package:build_pip_package
$ ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
$ pip install /tmp/tensorflow_pkg/tensorflow-2.4.2-cp38-cp38-linux_x86_64.whl
$ conda deactivate

# clean the cache
$ chmod -R 744 ~/.cache/bazel
$ rm -rf ~/.cache/bazel

The procedure below was found to work on Della (CPU) in 2020:

# ssh della
$ module load anaconda3/2023.3
# anaconda3 provides us with pip six numpy wheel setuptools mock
$ pip install -U --user keras_applications --no-deps pip install -U --user keras_preprocessing --no-deps

$ cd /tmp/bazel/
$ wget https://github.com/bazelbuild/bazel/releases/download/2.0.0/bazel-2.0.0-installer-linux-x86_64.sh
$ chmod +x bazel-2.0.0-installer-linux-x86_64.sh
$ ./bazel-2.0.0-installer-linux-x86_64.sh --prefix=/tmp/bazel
$ export PATH=/tmp/bazel/bin:$PATH

$ cd ~/sw
$ git clone https://github.com/tensorflow/tensorflow.git
$ cd tensorflow
$ ./configure  # took defaults or answered no
$ module load rh/devtoolset/7
$ CC=`which gcc` BAZEL_LINKLIBS=-l%:libstdc++.a bazel build --verbose_failures //tensorflow/tools/pip_package:build_pip_package
$ ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
$ pip install --user /tmp/tensorflow_pkg/tensorflow-2.1.0-cp37-cp37m-linux_x86_64.whl
$ cd
$ python
>>> import tensorflow as tf


How to Learn TensorFlow

See the tutorials and guides. See examples on GitHub.


Getting Help

If you encounter any difficulties while running TensorFlow on one of our HPC clusters then please send an email to [email protected] or attend a help session.



Kyle Felker and Naser Mahfouz have made improvements to this page.