Using R on the HPC Clusters

OUTLINE

 

QUICK FIX

If you encounter difficulties while trying to install a common R package then try this:

$ module load rh/devtoolset/8
$ R
> install.packages("<package-name>")   # e.g., install.packages("dplyr")

Read the "Install R Packages" section below to understand why the rh environment module needs to be loaded.

VERSION 3.6 vs. 4.0

As of September 2020, the default version of R on all HPC clusters is 4.0. To continue using the previous version of 3.6, run the following command on the command line and/or in your Slurm script:

module load R/3.6.3

On Tiger, there is no such module so please move your work to version 4.

 

Running RStudio via Your Web Browser

RStudio is available through two web portals. You will need to use a VPN to connect from off-campus (GlobalProtect VPN is recommended). If you have an account on Adroit or Della then browse to https://myadroit.princeton.edu or https://mydella.princeton.edu. To begin a session, click on "Interactive Apps" and then "RStudio Server". For more details see this tutorial from DSS. Complete this form if you need an acount on Adroit.

OnDemand RStudio

While most packages can be installed through RStudio, at times you will need to perform the installation from the head node of Adroit or Della following the Quick Fix directions at the top of this page. To get to the head node from the OnDemand main menu, click on "Clusters" and then "<Name> Cluster Shell Access". This will take you to a black terminal window.

Uploading Files

From the MyAdroit/MyDella main menu choose "Files" then "/scratch/network/<YourNetID>" on MyAdroit or "/scratch/gpfs/<YourNetID>" on MyDella. Choose "New Dir" to make a directory with a name you create. Double click on the newly created directory to open it. Choose "Upload" to transfer your files from your local computer to Adroit/Della. If you need to edit a file after uploading then choose "Edit". You can also create new files. See a video demonstration of uploading files. Learn more about the different locations to store your files.

Internet Access is Not Available During Running Sessions

RStudio runs on the compute nodes which do not have Internet access. This means that you will not be able to download files, clone a repo from GitHub, etc. The exception to this is that many R packages can be installed but not all. If you need Internet access then in the main OnDemand menu, click on "Clusters" and then "<Name> Cluster Shell Access". This will present you with a black terminal screen on the head node where you can run commands which need Internet access. Any files or packages that you download while on the head node will be available on the compute nodes where your OnDemand session runs.

Using Packages like sf and lwgeom

When presented with the OnDemand form, for the "R Version" choose "default with geos 3.7.2".

 

Running RStudio on Nobel

If you have an X server like XQuartz or MobaXterm running on your laptop then follow the commands below to run RStudio:

$ ssh -Y <YourNetID>@nobel.princeton.edu
$ module load rh/devtoolset/8  # run this command if you will install packages
$ rstudio

If you encounter an error message like that below then it may be because you are over your quota:

X11 connection rejected because of wrong authentication.
(xstata-se:42220): > Gtk-WARNING **: 20:05:35.756: cannot open display: localhost:12.0

Try removing unnecessary files or complete this form to request more space from OIT. Research Computing does not maintain the filesystems of Nobel. Also, make sure that you satisfy the X server requirements described on this page.

Most users have a 5 GB quota. Run the following command to see how much storage you are using:

$ du -sh ~/.

To see which folders are taking up the most space:

$ du -h --max-depth=1 ~/. | sort -hr

You can remove individual files with:

$ rm <file1> <file2> <file3>

Or remove entire directories with:

$ rm -rf <directory1> <directory2> <directory3>

Your home directory on Nobel, which is also known as the H: drive, has this absolute path: /n/homeserver2/user2a/<YourNetID>.

 

Installing R Packages

R packages may be distributed in source form or as compiled binaries. Packages that come in source form must be compiled before they can be installed in your /home directory. The recommended tool suite for doing this is the GNU Compiler Collection (GCC) and specifically g++, which is the C++ compiler. To provide a stable environment for building software on our HPC clusters, the default version of GCC is kept the same for years at a time. To see the current version of g++, run the following command on one of the HPC clusters (e.g., Della):

$ g++ --version
g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)

While most R packages will compile with the current long-term version of GCC, some require a newer version. A newer version is made available by loading one of the latest Red Hat Developer Toolset (rh/devtoolset) modules:

$ module load rh/devtoolset/8
$ g++ --version
g++ (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)

After loading the rh module, start R and run install.packages() as shown above in the Quick Fix. Note that the C and Fortran compilers and related tools are also updated by this method which is important for some R packages. Popular packages that require the rh module to be loaded for installation are dplyr, tidyverse, rstan, devtools, rvest, argparse and lubridate. There are many more. In fact, it is a good idea to always load the rh module before installing R packages.

 

Before You Install

Make sure you have enough disk space before installing. This can be done by running the checkquota command:

$ checkquota

          Storage/size quota filesystem report for user: ceisgrub
Filesystem               Mount                 Used   Limit  MaxLim Comment
Adroit home              /home                8.3GB   9.3GB    10GB 
Adroit scratch           /scratch                 0       0       0 
Adroit scratch network   /scratch/network     8.2GB       0       0 

          Storage number of files used report for user: ceisgrub
Filesystem               Mount                 Used   Limit  MaxLim Comment
Adroit home              /home                52.9K    975K    1.0M 
Adroit scratch           /scratch                 1       0       0 
Adroit scratch network   /scratch/network     39.3K       0       0 

For quota increase requests please use this website:

         https://forms.rc.princeton.edu/quota

The difference between Limit and Used in the /home row is your available space. In the example above the user has 9.3 - 8.3 = 1 GB available. Most packages require fewer than 0.1 GB. However, if you are installing many packages then disk space should be a concern. If you require more space then follow the link at the bottom of the output of checkquota to request more.

 

Installing the First Package

After connecting to one of the clusters via ssh, load the rh module, start R and then install a package. The first time you do this you will want to answer 'yes' to the first two questions and then choose the value for USA (OH) when asked to select a CRAN mirror. Below is a full example session on Della:

$ ssh <YourNetID>@della.princeton.edu
$ module load rh/devtoolset/8
$ R

R version 3.6.2 (2019-12-12) -- "Dark and Stormy Night"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
...

> install.packages("argpase")
Installing package into ‘/usr/lib64/R/library’
(as ‘lib’ is unspecified)
Warning in install.packages("lubridate") :
  'lib = "/usr/lib64/R/library"' is not writable
Would you like to use a personal library instead? (yes/No/cancel) yes
Would you like to create a personal library
‘~/R/x86_64-redhat-linux-gnu-library/3.6’
to install packages into? (yes/No/cancel) yes
--- Please select a CRAN mirror for use in this session ---
Secure CRAN mirrors 

 1: 0-Cloud [https]                   2: Algeria [https]                
 3: Australia (Canberra) [https]      4: Australia (Melbourne 1) [https]
 5: Australia (Melbourne 2) [https]   6: Australia (Perth) [https]      
 7: Austria [https]                   8: Belgium (Ghent) [https]        
 9: Brazil (BA) [https]              10: Brazil (PR) [https]            
11: Brazil (RJ) [https]              12: Brazil (SP 1) [https]          
13: Brazil (SP 2) [https]            14: Bulgaria [https]               
15: Chile (Santiago) [https]         16: China (Hong Kong) [https]      
17: China (Lanzhou) [https]          18: China (Shanghai) [https]       
19: Colombia (Cali) [https]          20: Czech Republic [https]         
21: Denmark [https]                  22: Ecuador (Cuenca) [https]       
23: Ecuador (Quito) [https]          24: Estonia [https]                
25: France (Lyon 1) [https]          26: France (Lyon 2) [https]        
27: France (Marseille) [https]       28: France (Montpellier) [https]   
29: Germany (Erlangen) [https]       30: Germany (Göttingen) [https]    
31: Germany (Münster) [https]        32: Germany (Regensburg) [https]   
33: Greece [https]                   34: Hungary [https]                
35: Iceland [https]                  36: Indonesia (Jakarta) [https]    
37: Ireland [https]                  38: Italy (Padua) [https]          
39: Japan (Tokyo) [https]            40: Japan (Yonezawa) [https]       
41: Korea (Busan) [https]            42: Korea (Gyeongsan-si) [https]   
43: Korea (Seoul 1) [https]          44: Korea (Ulsan) [https]          
45: Malaysia [https]                 46: Mexico (Mexico City) [https]   
47: Morocco [https]                  48: Norway [https]                 
49: Philippines [https]              50: Russia [https]                 
51: Spain (Madrid) [https]           52: Sweden [https]                 
53: Switzerland [https]              54: Turkey (Denizli) [https]       
55: Turkey (Mersin) [https]          56: UK (Bristol) [https]           
57: UK (London 1) [https]            58: USA (CA 1) [https]             
59: USA (IA) [https]                 60: USA (KS) [https]               
61: USA (MI 1) [https]               62: USA (MI 2) [https]             
63: USA (OR) [https]                 64: USA (TN) [https]               
65: USA (TX 1) [https]               66: Uruguay [https]                
67: (other mirrors)                  

Selection: 64

Your desired package and its dependencies will be built and installed. To help with organization, you can make different libraries and install your packages into the library of your choosing. After your first session, you will only be asked to select the CRAN mirror when installing a package.

 

GSL, GDAL and GEOS Environment Modules

In addition to the rh module, some R packages require a newer version of the GNU Scientific Library (GSL). This can be accomplished with:

$ module load rh/devtoolset/8
$ module load gsl/2.6
$ R
> install.packages("<package-name>")

Other common environment modules to load are gdal and geos. gdal is required for installing rgdal, for instance. Here is a sample session:

$ module load rh/devtoolset/8
$ module load gdal/2.2.4
$ module load geos/3.7.2
$ R
> install.packages("<package-name>")

IMPORTANT: If you built a package with the gsl or gdal modules loaded then you will need to add module load gsl or module load gdal, respectively, to your Slurm script. The same is true for geos. One does not need to include the rh module in the Slurm script.

One example that requires the gdal module to be loaded is the sf package. If one fails to load the module then the following error will result:

configure: error: sf is not compatible with GDAL versions below 2.0.1

The geojsonio package also requires the gdal module.

 

Custom Modules

You can create your own environment modules which can then be loaded for an OnDemand session. For example, one can create a Conda environment of R packages and load the module for this environment. See the directions for creating custom modules. To get this work on MyAdroit, add your module files here:

/home/<YourNetID>/Modules/modulefiles/

Specify the name of the module in the appropriate field when creating the OnDemand session.

 

RStan

The RStan package compiles models from source at run time. For this reason it is necessary to make a modern compiler suite available. This can be done by including this line your Slurm script:

module load rh/devtoolset/8

Failure to load this module will result in an error such as:

g++: error: unrecognized command line option ‘-std=gnu++14’make: *** [file396ab327a56f.o] Error 1

This must also be done for packages that depend on RStan such as brms. These directions apply to Adroit and Della. Tiger is not configured to work with RStan since that cluster is designed for multinode parallel jobs.

 

Development Versions 

You can install the released version of a package such as furrr from CRAN with:

> install.packages("furrr")

In certain cases, such as when you need the bleeding-edge changes and bug fixes, you should install the development version from GitHub with:

# install.packages("devtools")
> devtools::install_github("DavisVaughan/furrr")

 

Using Conda

One can create an isolated Conda environment composed of R packages and R itself. You can search for these packages on anaconda.org. For instance, to create a Conda environment that includes rmapshaper and other packages:

$ module load anaconda3/2020.11
$ conda create --name mshpr-env --channel conda-forge r-dplyr r-rmapshaper r-sf
$ conda activate mshpr-env
$ R
> q()

Note that a Conda environment composed of R packages comes with its own R executable. Be sure to load the anaconda3/2020.11 module and activate the environment in your Slurm script.

 

Submitting Jobs to the Batch Scheduler

The following Slurm script could be used to run a serial R job:

#!/bin/bash
#SBATCH --job-name=R-serial      # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G per cpu-core is default)
#SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=all          # send email on start, end and fault
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
Rscript myscript.R

If you built a package with the gsl or gdal modules loaded then you will need to add module load gsl or module load gdal, respectively, before the Rscript command in the script above.

Follow the commands below to run your first R script on Della, for example:

$ ssh <YourNetID>@della.princeton.edu
$ cd /scratch/gpfs/<YourNetID>
$ git clone https://github.com/PrincetonUniversity/hpc_beginning_workshop
$ cd hpc_beginning_workshop/RC_example_jobs/serial_R
# edit email address in job.slurm
$ sbatch job.slurm

There is a similar example to that above for the Adroit cluster here.

 

Example of Installing Packages, Uploading Files and Running a Job

1. Install the required R packages

Connect via VPN then browse to myadroit or mydella. Choose "Clusters" then "_Adroit/Della Cluster Shell Access". This will open a black terminal screen. Run these commands (for your specific R packages):

$ module load rh/devtoolset/8
$ R
> install.packages(c("dplyr", "lubridate"))
# answer "yes" twice then choose OH as the mirror by entering the appropriate number
> q()

2. Upload your files

Return to your browser tab with the MyAdroit/MyDella main menu or Dashboard. Choose "Files" then "/scratch/network/<YourNetID>" on MyAdroit or "/scratch/gpfs/<YourNetID>" on MyDella. Choose "New Dir" to make a directory with a name you create (below this is referred to as <JobDirectory>). Double click on the newly created directory to open it. Choose "Upload" to transfer your R script, data files and Slurm script (job.slurm) from your local computer to Adroit/Della. If you need to edit a file after uploading then choose "Edit". You can also create new files.

3. Submit the job

Return to the tab with the black terminal. Run these commands:

$ cd /scratch/network/<YourNetID>/<JobDirectory>  # or /scratch/gpfs for della
$ sbatch job.slurm

To monitor the status of the job use:

$ squeue -u $USER

Once the job is complete you can download the files using the MyAdroit/MyDella GUI. To learn more about Slurm and the Linux command line see this guide.

 

Where to Run R Jobs

Adroit and Della are ideal for R jobs. The TigerCPU cluster was designed for parallel jobs that require multiple nodes. The scheduler on these clusters has been configured to give serial or single-node jobs the lowest priority. In some cases, squeue will classify the reason that the small job is pending as (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions). This is indicating that the required resources are being used for large jobs. Your serial or single-node job will eventually run, however. If you only have an account on Tiger and you want to run several small R jobs then please write to cses@princeton.edu to request an account on Della. Be sure to explain the situation.

 

Where to Store Your Files

You should run your jobs out of /scratch/gpfs/<YourNetID> on the HPC clusters. These filesystems are very fast and provide vast amounts of storage. Do not run jobs out of /tigress or /projects. That is, you should never be writing the output of actively running jobs to those filesystems. /tigress and /projects are slow and should only be used for backing up the files that you produce on /scratch/gpfs. Your /home directory on all clusters is small and it should only be used for storing source code and executables. The commands below give you an idea of how to properly run an R job:

$ ssh <YourNetID>@della.princeton.edu
$ cd /scratch/gpfs/<YourNetID>
$ mkdir myjob && cd myjob
# put R script and Slurm script in myjob
$ sbatch job.slurm

If the run produces data that you want to backup then copy or move it to /tigress:

$ cp -r /scratch/gpfs/<YourNetID>/myjob /tigress/<YourNetID>

For large transfers consider using rsync instead of cp. Most users only do back-ups to /tigress every week or so. While /scratch/gpfs is not backed-up, files are never removed. However, important results should be transferred to /tigress or /projects. The diagram below gives an overview of the filesystems:

HPC clusters and the filesystems that are available to each. Users should write job output to /scratch/gpfs.

 

Optimizing Performance

The performance of numerically intensive packages such as RStan can be improved through compiler optimizations and vectorization. If you are an advanced user, before installing such a package, you may consider turning on these optimizations by creating a ~/.R/Makevars file containing these lines:

CC = gcc
CXX = g++
FC = gfortran
CFLAGS = -O3 -ffast-math -march=native -fwhole-program -fpic -m64
CXXFLAGS = -O3 -ffast-math -march=native -fwhole-program -fpic -m64
FFLAGS = -O3 -ffast-math -march=native -fwhole-program -fpic -m64

After installing such a package with the Makevars settings above, you must remove or rename the Makevars file to prevent the optimizations from creating incompatibilities with packages to be installed at a later time.

There may be times when you will need to specify a language standard. This can be done by adding a line to Makevars such as:

CXX14STD = -std=c++14

 

Getting R Packages onto a Secure VM

Most VMs are unable to reach the Internet and unreachable from the Internet for security purposes. In such cases one cannot directly install packages. One solution to this is to setup a similar environment on another machine which has Internet access and then copy it over (or have someone copy it). We generally suggest getting an account on Adroit, one of our cluster machines, where you can use the head node of the cluster to create the environment. You can request an account on here.

Once you have an account on Adroit, and are connected to one of the University VPN services, you can SSH from your computer directly to Adroit. You'll then install the R packages. Here is an example session:

$ ssh <YourNetID>@adroit.princeton.edu
$ mkdir mylibs
$ export R_LIBS_USER=/home/<YourNetID>/mylibs
$ module load rh/devtoolset/8
$ R
> .libPaths()  # "/home/<YourNetID>/mylibs" should appear in the list
> install.packages(c("dplyr", "ggplot2", "lubridate", "caret"))
...
Selection: 56  # choose a mirror such as "USA (OH) [https]"
...

When finished, quit R and return to the command line. Then use 'tar' to compress the mylibs directory:

$ tar cvzf mylibs.tar.gz mylibs

On the VM

Transfer mylibs.tar.gz to the VM and unpack it. In some cases this will need to be done by a member of Research Computing. To unpack the file use:

$ tar xzf mylibs.tar.gz

Then do:

$ export R_LIBS_USER=<path/to>/mylibs
$ R

You should be able to load the libraries in R. If you encounter problems use .libPaths() to check the paths that are searched for R libraries. You can also look inside the mylibs directory to check for the existence of certain packages and their dependencies.

 

Rmpi and HPC R

For directions on building Rmpi and approaches to parallelizing R scripts see this workshop. If you are using Rmpi on Della and you find that jobs hang, try adding this line to the end of your R script:

Rmpi::mpi.quit()

 

FAQ

1. I tried to install an R package but the installation failed with this error message: for loop initial declarations are only allowed in C99 mode. What should I do?

This problem can be solved by loading a newer version of GCC. To do this, before starting R, run this command on the command line: module load rh/devtoolset/8. Read the content above for the explanation for this solution.

2. Nothing is working properly and I want to delete all my R packages and start over. How do I do this?

To delete all of your R packages and R files: rm -rf ~/R ~/.R ~/.rstudio ~/.Rhistory ~/.Rprofile ~/.RData. You may also need to remove lines from your ~/.bashrc file if you added or modified environment variables.

3. How do I see where the R packages are installed?

The paths to system and user packages can be seen with this R command: > .libPaths(). To specifically see the path to user packages use: > Sys.getenv("R_LIBS_USER").

4. How do I see my installed packages? All base and user packages can be listed with the R command:

> installed.packages()

5. How do I see the default packages? Default packages can be listed with the R command:

> getOption("defaultPackages")

6. I have a list of packages. How do I install them all at once?

> install.packages(c("<package-name-1>", "<package-name-2>", ...))

7. How do I remove a package? A package can be removed with the command:

> remove.packages("<package-name>")

8. I have the source code for a package I want to install. How do I perform the installation? For RMPI, for instance, use this command on the command line:

$ R CMD INSTALL -l ~/.local/lib --no-test-load Rmpi_0.6-9.tar.gz

Then start R and do:

> library("Rmpi", lib.loc="/home/<YourNetID>/.local/lib")

Or one could set an R environment variable in ~/.bashrc:

export R_LIBS=~/.local/lib:$R_LIBS

9. How can I solve the following error?

Installing package into ‘/home/ceisgrub/R/x86_64-redhat-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
Error in structure(.External(.C_dotTclObjv, objv), class = "tclObj") :
  [tcl] grab failed: window not viewable.

R is trying to display a list of mirrors via X11 forwarding so try unsetting DISPLAY before starting R:

$ unset DISPLAY
$ module load rh/devtoolset/8
$ R
> install.packages("<package-name>")

10. Which BLAS/LAPACK library is R using?

This information is available by running the sessionInfo() command.

11. How should I deal with this error: 'ERROR: failed to lock directory '/home/aturing/R/x86_64-redhat-linux-gnu-library/4.0' for modifying. Try removing '/home/aturing/R/x86_64-redhat-linux-gnu-library/4.0/00LOCK-data.table'?

Follow the suggestion which says to remove a specific directory. This can be done with: rm -rf /home/aturing/R/x86_64-redhat-linux-gnu-library/4.0/00LOCK-data.table. Be sure to use the path from your own case. If you are using RStudio on MyAdroit or MyDella then this command must be run on the command line. From the OnDemand main menu, click on "Clusters" and then "<Name> Cluster Shell Access". This will take you to a black terminal window where you can run the command.

12. How should I deal with this error: "Error: protect(): protection stack overflow."?

This error can occur when working with large data files. We are not aware of the solution for RStudio on MyAdroit or MyDella but if you submit a job to the Slurm scheduler, call Rscript in this way:

Rscript --max-ppsize=500000 myscript.R

13. How do I use a version of R from Anaconda instead of the system R in RStudio?

On the head node, activate your environment then use the "which R" command to get the path. Enter the full path into the PATH field (do not use ~ or $HOME) when creating the RStudio session via MyAdroit/MyDella.

14. What does "Status code 502" mean when using OnDemand RStudio?

This error can arise when a user tries to run two commands at once. Try letting each command finish before running the next. For instance, if the script is executing then do not try to also save the R file. The error can also be an indication of requesting insufficient RAM. Try starting a new session with more RAM.

 

Getting Help from Data & Statistical Services (DSS)

For help on using R with data analysis please see the DSS website. DSS offers online tutorials and training for performing data analysis with R as well as one-on-one appointments.

 

Getting Help from Research Computing

If you encounter any difficulties while working with R on the HPC clusters then please send an email to cses@princeton.edu or attend a walk-in help session.