Python on the HPC Clusters

OUTLINE

 

This guide presents an overview of installing Python packages and running Python scripts on the HPC clusters. Angular brackets < > denote command line options that you should replace with a value specific to your work. Commands preceded by the $ character are to be run on the command line.

Quick Start

If you don't want to spend the time to read this entire page (not recommended) then try the following procedure to install your package(s) (below we assume Python 3):

$ module load anaconda3/2020.11
$ conda create --name myenv <package-1> <package-2> ... <package-N>
$ conda activate myenv

Each package and its dependencies will be installed locally in ~/.conda. Consider replacing myenv with an environment name that is more specific to your work. On the command line, use conda deactivate to leave the active environment and return to the base environment. Below is a sample Slurm script (job.slurm):

#!/bin/bash
#SBATCH --job-name=py-job        # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G per cpu-core is default)
#SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=all          # send email when job begins, ends and fails
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
module load anaconda3/2020.11
conda activate myenv

python myscript.py

If the installation was successful then your job can be submitted to the cluster with sbatch job.slurm. If the installation failed and packages were downloaded then you should remove those packages before proceeding (see contents of ~/.conda). If for some reason you are trying to install a Python 2 package then use module load anaconda/<version> instead of anaconda3/<version> in the directions above. Note that Python 2 has been unsupported since January 1, 2020.

See step-by-step directions for uploading files and running a Python script on Adroit.

 

Introduction

When you first login to one of the clusters, the system Python is available but this is almost always not what you want. To see the system Python, run these commands:

$ python --version
Python 2.7.5

$ which python
/usr/bin/python

$ python3 --version
Python 3.6.8

$ which python3
/usr/bin/python3

We see that python corresponds to version 2 and python and python3 are installed in a system directory.

On the Princeton HPC clusters we offer the Anaconda Python distribution as replacement to the system Python. In addition to Python's vast built-in library, Anaconda provides hundreds of additional packages which are ideal for scientific computing. In fact, many of these packages are optimized for our hardware. To make Anaconda Python available, run the following command:

$ module load anaconda3/2020.11

Let's inspect our newly loaded Python by using the same commands as above:

$ python --version
Python 3.8.3

$ which python
/usr/licensed/anaconda3/2020.7/bin/python

$ python3 --version
Python 3.8.3

$ which python3
/usr/licensed/anaconda3/2020.7/bin/python3

We now have an updated version of Python and related tools. In fact, the new python and python3 commands are identical as they are in fact symbolic links to python3.8. To see all the pre-installed Anaconda packages and their versions use the conda list command:

$ conda list
# packages in environment at /usr/licensed/anaconda3/2020.7:
#
# Name                    Version                   Build  Channel
_ipyw_jlab_nb_ext_conf    0.1.0                    py38_0  
_libgcc_mutex             0.1                        main  
alabaster                 0.7.12                     py_0  
anaconda                  2020.07                  py38_0  
anaconda-client           1.7.2                    py38_0  
anaconda-navigator        1.9.12                   py38_0  
anaconda-project          0.8.4                      py_0  
argh                      0.26.2                   py38_0  
asn1crypto                1.3.0                    py38_0  
astroid                   2.4.2                    py38_0  
...

There are 316 packages pre-installed and ready to be used with a simple import statement. If the packages you need are on the list or are found in the Python standard library then you can begin your work. Otherwise, keep reading to learn how to install packages.

The Anaconda Python distribution is a system library. This means that you can use any of its packages but you cannot make any modifications to them (such as an upgrade) and you cannot install new ones in their location. You can, however, install whatever packages you want in your home directory. This allows you to utilize both the pre-installed Anaconda packages and the new ones that you install yourself. The two most popular package managers for installing Python packages are conda and pip.

 

checkquota

Python packages can require many gigabytes of storage. By default they are install in your /home directory which is typically around 10-20G. Be sure to run the checkquota command before installing.

 

Package and Environment Managers

conda

Unlike pip, conda is both a package manager and an environment manager. It is also language-agnostic which means that in addition to Python packages, it is also used for R and Fortran, for example. Conda looks to the main channel of Anaconda Cloud to handle installation requests but there are numerous other channels that can be searched such as bioconda, intel, r and conda-forge. Conda always installs pre-built binary files. The software it provides often has performance advantages over other managers due to leveraging Intel MKL, for instance. Below is a typical session where an environment is created and one or more packages are installed in to it:

$ module load anaconda3/2020.11
$ conda create --name myenv <package-1> <package-2> ... <package-N>
$ conda activate myenv

Note that you should specify all the packages that you need in one line so that the dependencies can be satisfied simultaneously. Installing packages into the environment at a later time is possible. To exit a conda environment, run this command: conda deactivate. If you try to install using conda install <package> it will fail with: EnvironmentNotWritableError: The current user does not have write permissions to the target environment. The solution is to create an environment and do the install in the same command (as shown above).

Common conda commands

View the help menu:

$ conda -h

To view the help menu for the install command:

$ conda install --help

Search the conda-forge channel for the fenics package:

$ conda search fenics --channel conda-forge

List all the installed packages for the present environment (consider adding --explicit):

$ conda list

Create the myenv environment and install pairtools into the that environment:

$ conda create --name myenv pairtools

Create an environment called myenv and install Python version 3.6 and beaver:

$ conda create --name myenv python=3.6 beaver

Create an environment called biowork-env and install blast from the bioconda channel:

$ conda create --name biowork-env --channel bioconda blast

Install the pandas package into an environment that was previously created:

$ conda activate biowork-env
(biowork-env)$ conda install pandas

List the available environments:

$ conda list --envs

Remove the bigdata-env environment:

$ conda remove --name bigdata-env --all

Much more can be done with conda as a package manager or environment manager.

 

pip

pip stands for "pip installs packages". It is a package manager for Python packages only. pip installs packages that are hosted on the Python Package Index or PyPI.

You will typically want to use pip within a Conda environment after installing packages via conda to get packages that are not available on Anaconda Cloud. For example:

$ module load anaconda3/2020.11
$ conda create --name sklearn-env scikit-learn pandas matplotlib
$ pip install multiregex

You should avoid installing conda packages after doing pip installs within a Conda environment.

Do not use the pip3 command even if the directions you are following tell you to do so (use pip instead). pip will search for a pre-compiled version of the package you want called a wheel. If it fails to finds this for your platform then it will attempt to build the package from source. It can take pip several minutes to build a large package from source. One often needs to load various environment modules in addition to anaconda3 before doing a pip install. For instance, if your package uses GPUs then you will probably need to do module load cudatoolkit or if it uses the message-passing interface (MPI) for parallelization then module load openmpi. To see all available software modules, run module avail.

Common pip commands

View the help menu:

$ pip -h

The help menu for the install command:

$ pip install --help

Search the Python Package Index PyPI for a given package (e.g., jax):

$ pip search jax

List all installed packages:

$ pip list

Install pairtools and pyblast for version 3.5 of Python

$ pip install python==3.5 pairtools pyblast

Install a set of packages listed in a text file

$ pip install -r requirements.txt

To see detailed information about an installed package such as sphinx:

$ pip show sphinx

Upgrade the sphinx package:

$ pip install --upgrade sphinx

Uninstall the pairtools package:

$ pip uninstall pairtools

See the pip documentation for more.

 

Isolated Python Environments with virtualenv

Often times you will want to create isolated Python environments. This is useful, for instance, when you have two packages that require different versions of a third package. The use of environments saves one the trouble of repeatedly upgrading or downgrading the third package in this case. We recommend using virtualenv to create isolated Python environments. To get started with virtualenv it must first be installed:

$ module load anaconda3/2020.11
$ pip install --user virtualenv

Note that like pip, virtualenv is an executable, not a library. To create an isolated environment do:

$ mkdir myenv
$ virtualenv myenv
$ source myenv/bin/activate

Consider replacing myenv with a more suitable name for your work. Now you can install Python packages in isolation from other Python environments:

$ pip install slingshot bell
$ deactivate

Note the --user option is omitted since the packages will be installed locally in the virtual environment. At the command line, to leave the environment run deactivate.

Make sure you source the environment in your Slurm script as in this example:

#!/bin/bash
#SBATCH --job-name=py-job        # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G per cpu-core is default)
#SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=all          # send email when job begins, ends and fails
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
module load anaconda3/2020.11
source </path/to>/myenv/bin/activate

python myscript.py

As an alternative to virtualenv, you may consider using the built-in Python 3 module venv. pip in combination with virtualenv serve as powerful package and environment managers. There are also combined managers such as pipenv and pyenv that you may consider.

 

pip vs. conda

If your package exists on PyPI and Anaconda Cloud then how do you decide which to install from? You should almost always favor conda over pip. This is because conda packages are pre-compiled and their dependencies are automatically handled. While pip installs will often download a binary wheel (pre-compiled), the user frequently needs to take action to satisfy the dependencies. Furthermore, many scientific conda packages are linked against the Intel Math Kernel Library which leads to improved performance over pip installs on our systems. One disadvantage of conda packages is that they tend to lag behind pip packages in terms of versioning. In many cases, the decision of conda versus pip will be answered by reading the installation instructions for the software you would like to use. Write to cses@princeton.edu for a recommendation on the installation procedure or if you encounter problems while trying to run your Python script.

 

Installing Python Packages from Source

In some cases you will be provided with the source code for your package. To install from source do:

$ python setup.py install --prefix=</path/to/install/location>

For help menu use python setup.py --help-commands. Be sure to update the appropriate environment variables in your ~/.bashrc file:

export PATH=</path/to/install/location>/bin:$PATH
export PYTHONPATH=</path/to/install/location>/lib/python<version>/site-packages:$PYTHONPATH

 

Packaging and Distributing Your Own Python Package

Both PyPI and Anaconda allow registered users to store their packages on their platforms. You must follow the instructions for doing so but once done someone can do a pip install or a conda install of your package. This makes it very easy to enable someone else to use your research software. See this guide for practical examples of the process.

 

Where to Store Your Files

You should run your jobs out of /scratch/gpfs/<YourNetID> on the HPC clusters. These filesystems are very fast and provide vast amounts of storage. Do not run jobs out of /tigress or /projects. That is, you should never be writing the output of actively running jobs to those filesystems. /tigress and /projects are slow and should only be used for backing up the files that you produce on /scratch/gpfs. Your /home directory on all clusters is small and it should only be used for storing source code and executables.

The commands below give you an idea of how to properly run a Python job:

$ ssh <YourNetID>@della.princeton.edu
$ cd /scratch/gpfs/<YourNetID>
$ mkdir myjob && cd myjob
# put Python script and Slurm script in myjob
$ sbatch job.slurm

If the run produces data that you want to backup then copy or move it to /tigress or /projects, for example:

$ cp -r /scratch/gpfs/<YourNetID>/myjob /tigress/<YourNetID>

For large transfers consider using rsync instead of cp. Most users only do back-ups to /tigress every week or so. While /scratch/gpfs is not backed-up, files are never removed. However, important results should be transferred to /tigress or /projects.

The diagram below gives an overview of the filesystems:

HPC clusters and the filesystems that are available to each. Users should write job output to /scratch/gpfs.

 

 

Jupyter Notebooks on the HPC Clusters

Please see our page for Jupyter on the HPC Clusters.

OnDemand Jupyter

Multiprocessing

The multiprocessing module enables single-node parallelism for Python scripts based on the subprocess module. The script below uses multiprocessing to execute an embarrassingly parallel mapping of a short list:

import os
from multiprocessing import Pool

def f(x):
  return x*x

if __name__ == '__main__':
  num_cores = int(os.getenv('SLURM_CPUS_PER_TASK'))
  with Pool(num_cores) as p:
    print(p.map(f, [1, 2, 3, 4, 5, 6, 7, 8]))

The scipt above can also be used to parallelize a for loop. Below is an appropriate Slurm script for this code:

#!/bin/bash
#SBATCH --job-name=multipro      # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=4        # number of processes
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G per cpu-core is default)
#SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)

module purge
module load anaconda3/2020.11

srun python myscript.py

The output of the Python script is:

[1, 4, 9, 16, 25, 36, 49, 64]

The Python script extracts the number of cores from the Slurm environment variable. This eliminates the potential problems that could arise if the two values were set independently in the Slurm script and Python script.

Often times the best way to carry out a large number independent Python jobs is using an job array and not by using the multiprocessing module.

 

Debugging Python

Learn more about debugging Python code on the Princeton HPC clusters.

This video explains how to run the PyCharm debugger on a TigerGPU node. The same procedure can be used for the other clusters. PyCharm for Linux is available on jetbrains.com. While the video uses the Community Edition, you can get the professional edition for free by supplying your "dot edu" email address.

 

Profiling Python

The most highly recommended tool for profiling Python is line_profiler which makes it easy to see how much time is spent on each line within a function as well as the number of calls.

The built-in cProfile module provides a simple way to profile your code:

python -m cProfile -s tottime myscript.py

However, most users find that the cProfile module provides information that is too fine grained.

PyCharm can be used for profiling. By default it uses cProfile. If you are working with multithreaded code then you should install and use yappi.

Within Jupyter notebooks one may use %time and %timeit for doing measurements.

Arm MAP may be used to profile some Python scripts that call compiled code. See our MAP guide for specific instructions.

 

Building Python from Source

The procedure below shows how to build Python from source:

$ cd $HOME/software  # or another location
$ wget https://www.python.org/ftp/python/3.8.5/Python-3.8.5.tgz
$ tar zxf Python-3.8.5.tgz
$ cd Python-3.8.5
$ module load rh/devtoolset/8
$ ./configure --help
$ ./configure --enable-optimizations --prefix=$HOME/software/python385
$ make -j 10
$ make test  # some tests fail
$ make install
$ cd python385/bin
$ ./python3

 

Common Package Installation Examples

FEniCS

FEniCS is an open-source computing platform for solving partial differential equations. To install:

$ module load anaconda3/2020.11
$ conda create --name fenics-env -c conda-forge fenics
$ conda activate fenics-env

Make sure you include conda activate fenics-env in your Slurm script. For better performance one may consider installing from source.

CuPy on Traverse

CuPy is available via Anaconda Cloud on all our clusters. For Traverse use the IBM WML channel:

$ module load anaconda3/2020.11
$ CHNL="https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda"
$ conda create --name cupy-env --channel ${CHNL} cupy

Be sure to include module load anaconda3/2020.11 in your Slurm script.

JAX

JAX is  Autograd and XLA, brought together for high-performance machine learning research. See the Intro to ML Libraries repo for build directions.

PyStan

Here are the directions for installing PyStan:

$ module load anaconda3/2020.11
$ conda create --name stan-env pystan
$ conda activate stan-env

To compile models, your Slurm script will need to include the rh module, which provides a newer compiler suite:

#!/bin/bash
#SBATCH --job-name=myjob         # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G per cpu-core is default)
#SBATCH --time=01:00:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=all          # send email when job begins, ends and fails
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
module load rh/devtoolset/8
module load anaconda3/2020.11
conda activate stan-env

python myscript.py

Try varying the value of cpus-per-task to see if you get a speed-up. Note that the more resources you request, the longer the queue time.

Deeplabcut

$ ssh -Y <YourNetID>@tigergpu.princeton.edu
$ module load hdf5/gcc/1.8.16 anaconda3/2019.10 cudatoolkit/9.2 cudnn/cuda-9.2/7.3.1
$ export HDF5_DIR=/usr/local/hdf5/gcc/1.8.16
$ conda create --name deeplabcut-env python=3.6 wxPython=4.0.3
$ conda activate deeplabcut-env
$ pip install deeplabcut tensorflow-gpu==1.8

Note that some warnings will be produced when deeplabcut is imported in Python. You will need to add the “export HDF5_DIR …” command to your ~/.bashrc file. By using ssh -Y the error of Cannont load backend 'TkAgg' will not occur.

Lenstools

$ module load anaconda3/2020.11
$ conda create --name lenstools-env numpy scipy pandas matplotlib astropy
$ conda activate lenstools-env
$ module load rh/devtoolset/8 openmpi/gcc/3.1.5/64 gsl/2.4 
$ export MPICC=$(which mpicc)
$ pip install mpi4py
$ pip install emcee==2.2.1
$ pip install lenstools

Note that you will receive warnings when lenstools is imported in Python.

SMC++

SMC++ infers population history from whole-genome sequence data. In this case pip is used to avoid a glibc conflict.

$ module load anaconda3/2020.11
$ pip install --user virtualenv
$ mkdir myenv
$ virtualenv myenv
$ source myenv/bin/activate
$ pip install cython numpy
$ pip install git+https://github.com/popgenmethods/smcpp
$ smc++ --help

Dedalus

Dedalus can be used to solve differential equations using spectral methods.

$ module load anaconda3/2020.11
$ conda create --name dedalus-env python=3.6
$ conda activate dedalus-env
$ conda config --add channels conda-forge
$ conda install nomkl cython docopt matplotlib pathlib scipy
$ module load openmpi/gcc/1.10.2/64 fftw/gcc/openmpi-1.10.2/3.3.4 hdf5/gcc/openmpi-1.10.2/1.10.0
$ export FFTW_PATH=$FFTW3DIR
$ export HDF5_DIR=$HDF5DIR
$ export MPI_PATH=/usr/local/openmpi/1.10.2/gcc/x86_64
$ export MPICC=$(which mpicc)
$ pip install mpi4py
$ CC=mpicc pip install --upgrade --no-binary :all: h5py
$ hg clone https://bitbucket.org/dedalus-project/dedalus
$ cd dedalus
$ pip install -r requirements.txt
$ python setup.py build
$ python setup.py install

TensorFlow

See our guide for TensorFlow on the HPC clusters.

PyTorch

See our guide for PyTorch on the HPC clusters.

mpi4py

MPI for Python (mpi4py) provides bindings of the Message Passing Interface (MPI) standard for the Python programming language. It can be used to parallelize Python scripts. See our guide for installing mpi4py on the HPC clusters.

 

FAQ

1. Why does pip install <package> fail with an error mentioning a Read-only file system?

After loading the anaconda3 module, pip will be available as part of Anaconda Python which is a system package. By default pip will try to install the files in the same locations as the Anaconda packages. Because you don't have write access to this directory the install will fail. One needs to add --user as discussed above.

2. What should I do if I try to install a Python package and the install fails with: error: Disk quota exceeded?

You have three options. First, consider removing files within your home directory to make space available. Second, run the checkquota command and follow the link at the bottom to request more space. Lastly, for pip installations see the question toward the bottom of this FAQ for a third possibility i.e., setting --location to /scratch/gpfs/<YourNetID>. For conda installs try learning about the --prefix option.

3. Why do I get the following error message when I try to run pip on Della: -bash: pip: command not found?

You need to do module load anaconda3 before using pip or any of the Anaconda packages. You also need to load this module before using Python itself.

4. I read that it is a good idea to update conda before installing a package. Why do I get an error message when I try to perform the update?

conda is a system executable. You do not have permission to update it. If you try to update it you will get this error: EnvironmentNotWritableError: The current user does not have write permissions to the target environment. The current version is sufficient to install any package.

5. When I run conda list on the base environment I see the package that I need but it is not the right version. How can I get the right version? One solution is to create a conda environment and install the version you need there. The version of NumPy on Tiger is 1.16.2. If you need version 1.16.5 for your work then do: conda create --name myenv numpy=1.16.5.

6. Is it okay if I combine virtualenv and conda?

This is highly discouraged. While in principle it can work, most users find it just causes problems. Try to stay within one environment manager. Note that if you create a conda environment you can use pip to install packages.

7. Can I combine conda and pip?

Yes, and this tends to work well. A typical session may look like this:

$ module load anaconda3/2020.11
$ conda create --name myenv python=3.6
$ conda activate myenv
$ pip install scitools

Note that --user is omitted when using pip within a conda environment. See the bullet points at the bottom of this page for tips on using this approach.

8. How do I install a Python package in a custom location using pip or conda?

For pip, first do pip install --target=</path/to/install/location> <package> then update the PYTHONPATH environment variable in your ~/.bashrc file with export PYTHONPATH=$PYTHONPATH:/path/to/install/location. For conda, you use the --prefix option. For instance, to install cupy on /scratch/gpfs/<YourNetID>:

$ module load anaconda3/2020.11
$ conda create --prefix /scratch/gpfs/$USER/py-gpu cupy

Be sure to have these two lines in your Slurm script: module load anaconda3/2020.11 and conda activate /scratch/network/$USER/py-gpu. Note that /scratch/gpfs is not backed up.

9. I tried to install some packages but now none of my Python tools are working. Is it possible to delete all my Python packages and start over?

Yes. Packages installed by pip are in ~/.local/lib while conda packages and environments are in ~/.conda. If you made any environments with virtualenv you should remove those as well. Removing these directories will give you a clean start. Be sure to examine the contents first. It may be wise to selectively remove sub-directories instead. You may also need remove the ~/.cache directory and you may need to make modifications to your .bashrc file if you added or changed environment variables.

10. How are my pip packages built? Which optimization flags are used? Do I have to be careful with vectorization on Della where several different CPUs are in play?

After loading the anaconda3 module, run this command: python3.7-config --cflags. To force a package to be built from source with certain optimization flags do, for example: CFLAGS="-O1" pip install numpy -vvv --no-binary=numpy

11. What is the Intel Python distribution and how do I get started with it? Intel provides their own implementation of Python as well as numerous packages optimized for Intel hardware. You may find significant performance benefits from these packages. To create a conda environment with Intel Python and a number of Intel-optimized numerics packages:

$ module load anaconda3/2020.11
$ conda create --name my-intel --channel intel python numpy scipy

12. The installation directions that I am following say to use pip3. Is this okay?

Do not use pip3 for installing Python packages. pip3 is a component of the system Python and it will not work properly with Anaconda. Always do module load anaconda3 and then use pip for installing packages.

 

Getting Help

If you encounter any difficulties while using Python on one of our HPC clusters then please send an email to cses@princeton.edu or attend a help session.