Using R on the Research Computing Clusters

OUTLINE

 

VERSION 4.4 IS NOW THE DEFAULT

As of May 2024, the default version of R on Della is 4.4. To continue using an earlier version such as 4.3, run the following command on the command line and/or in your Slurm script:

module load R/4.3.1

To see the available versions of R, run the following command:

$ module avail R

We recommend always loading an R environment module in your Slurm scripts even when you are using the default version. This will allow your scripts to be used without changes after the default version of R is updated. For example:

module purge
module load R/4.4.0
Rscript myscript.R

See the Slurm section for more on running batch jobs.

 

ONDEMAND RSTUDIO CAN NO LONGER ACCESS THE INTERNET (12/14/2022)

Internet access has been disabled in RStudio sessions due to security and compliance reasons. Previously, it was possible to install R packages and carry out other network operations in RStudio but this is no longer the case. The correct way to install R packages is to do so on the command line (see Installing R Packages below) before starting a session. If your work happens to require internet access in RStudio then use one of the visualization nodes by choosing "Interactive Apps" in the OnDemand main menu and then either "RStudio Server on Della Vis1" or "RStudio Server on Della Vis2". Keep in mind that the visualization nodes are shared between all users.

If you attempt to install R packages from within RStudio you will encounter this error:

> install.packages("microbenchmark")
Warning in install.packages :
  unable to access index for repository https://cran.rstudio.com/src/contrib:
  cannot open URL 'https://cran.rstudio.com/src/contrib/PACKAGES'

 

Running RStudio via Your Web Browser

Princeton Virtual Desktop

If you are most comfortable with Microsoft Windows and only need a single CPU-core, consider running RStudio using the Princeton Virtual Desktop. Choose "Student Labs" then "RStudio". Central OIT maintains this service, so please open a Support Ticket with issues. 

Learn More about Princeton Virtual Desktop.

 

Research Computing OnDemand

RStudio is available through two web portals. You will need to use a VPN to connect from off-campus (GlobalProtect VPN is recommended). If you have an account on Adroit or Della then browse to https://myadroit.princeton.edu or https://mydella.princeton.edu. To begin a session, click on "Interactive Apps" and then "RStudio Server". For more details see this tutorial from DSS. Complete this form if you need an acount on Adroit.

Note that on Adroit and Della, when you save any user data from your RStudio sessions (e.g. your session information, code history, etc.), those files are placed in your /scratch/network (on Adroit) or /scratch/gpfs (on Della) folder.

OnDemand RStudio

All R packages must be installed from the command line on the login node of Adroit or Della. To get to the command line on the login node from the OnDemand main menu, click on "Clusters" and then "<Name> Cluster Shell Access". This will take you to a black terminal window where you can install packages by running the appropriate commands, for example:

$ module load R/4.4.0  # use 4.3.1 on adroit
$ R
> install.packages("dplyr")

See below for more details on installing packages on the command line.

Uploading Files

From the MyAdroit/MyDella main menu choose "Files" then "/scratch/network/<YourNetID>" on MyAdroit or "/scratch/gpfs/<YourNetID>" on MyDella. Choose "New Dir" to make a directory with a name you create. Double click on the newly created directory to open it. Choose "Upload" to transfer your files from your local computer to Adroit/Della. If you need to edit a file after uploading then choose "Edit". You can also create new files. See a video demonstration of uploading files and many other operations in OnDemand. Learn more about the different locations to store your files.

Internet Access is Not Available During Running Sessions

RStudio runs on the compute nodes which do not have Internet access. This means that you will not be able to install R packages, download files, clone a repo from GitHub, etc. If you need internet access then in the main OnDemand menu, click on "Clusters" and then "<Name> Cluster Shell Access". This will present you with a black terminal screen on the head node where you can run commands which need internet access. Any files or packages that you download while on the head node will be available on the compute nodes where your OnDemand session runs. If your work happens to require internet access in RStudio then use one of the visualization nodes on Della by choosing "Interactive Apps" in the OnDemand main menu and then either "RStudio Server on Della Vis2" or "RStudio Server on Della Vis3". Keep in mind that the visualization nodes are shared between all users.

 

Running RStudio on Nobel (not recommended)

We suggest that you use RStudio via MyAdroit or MyDella and not on Nobel. However, if you have an X server like XQuartz or MobaXterm running on your laptop (for X11 forwarding) then follow the commands below to run RStudio on Nobel:

$ ssh -Y <YourNetID>@nobel.princeton.edu
$ rstudio

If you encounter an error message like that below then it may be because you are over your quota:

X11 connection rejected because of wrong authentication.
(xstata-se:42220): > Gtk-WARNING **: 20:05:35.756: cannot open display: localhost:12.0

Try removing unnecessary files or complete this form to request more space from OIT. Research Computing does not maintain the filesystems of Nobel. Also, make sure that you satisfy the X server requirements described on this page.

Most users have a 5 GB quota. Run the following command to see how much storage you are using:

$ du -sh ~/.

To see which folders are taking up the most space:

$ du -h --max-depth=1 ~/. | sort -hr

You can remove individual files with:

$ rm <file1> <file2> <file3>

Or remove entire directories with:

$ rm -rf <directory1> <directory2> <directory3>

Your home directory on Nobel is also known as the "H: drive".

 

Installing R Packages

R packages may be distributed in source form or as compiled binaries. Packages that come in source form must be compiled before they can be installed in your /home directory. The recommended tool suite for doing this is the GNU Compiler Collection (GCC) and specifically g++, which is the C++ compiler. To provide a stable environment for building software on our HPC clusters, the default version of GCC is kept the same for years at a time. To see the current version of g++, run the following command on one of the login nodes:

$ g++ --version
g++ (GCC) 8.5.0 20210514 (Red Hat 8.5.0-18)

 

Before You Install

Make sure you have enough disk space before installing. This can be done by running the checkquota command:

$ checkquota
          Storage/size quota filesystem report for user: ceisgrub
Filesystem               Mount                 Used   Limit  MaxLim Comment
Adroit home              /home                8.3GB   9.3GB    10GB 
Adroit scratch           /scratch                 0       0       0 
Adroit scratch network   /scratch/network     8.2GB       0       0 
          Storage number of files used report for user: ceisgrub
Filesystem               Mount                 Used   Limit  MaxLim Comment
Adroit home              /home                52.9K    975K    1.0M 
Adroit scratch           /scratch                 1       0       0 
Adroit scratch network   /scratch/network     39.3K       0       0 
For quota increase requests please use this website:
         https://forms.rc.princeton.edu/quota

The difference between Limit and Used in the /home row is your available space. In the example above the user has 9.3 - 8.3 = 1 GB available. Most packages require fewer than 0.1 GB. However, if you are installing many packages then disk space should be a concern. If you require more space then follow the link at the bottom of the output of checkquota to request more.

 

Installing the First Package

After connecting to one of the clusters via ssh, start R and then install a package. The first time you do this you will want to answer 'yes' to the first two questions and then choose the value for USA (OH) when asked to select a CRAN mirror. Below is a full example session on Della:

$ ssh <YourNetID>@della.princeton.edu
$ module load R/4.4.0
$ R
R version 4.4.0 (2024-04-24) -- "Puppy Cup"
Copyright (C) 2024 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
  Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> install.packages("argparse")
Installing package into ‘/usr/lib64/R/library’
(as ‘lib’ is unspecified)
Warning in install.packages("argparse") :
  'lib = "/usr/lib64/R/library"' is not writable
Would you like to use a personal library instead? (yes/No/cancel) yes
Would you like to create a personal library
‘/home/jdh4/R/x86_64-redhat-linux-gnu-library/4.4’
to install packages into? (yes/No/cancel) yes
--- Please select a CRAN mirror for use in this session ---
Secure CRAN mirrors
 1: 0-Cloud [https]
 2: Australia (Canberra) [https]
 3: Australia (Melbourne 1) [https]
 4: Australia (Melbourne 2) [https]
 5: Australia (Perth) [https]
 6: Austria [https]
 7: Belgium (Brussels) [https]
 8: Brazil (PR) [https]
...
68: UK (London 1) [https]
69: USA (IA) [https]
70: USA (MI) [https]
71: USA (MO) [https]
72: USA (OH) [https]
73: USA (OR) [https]
74: USA (TN) [https]
75: United Arab Emirates [https]
76: Uruguay [https]
77: (other mirrors)
Selection: 72

Your desired package and its dependencies will be built and installed. To help with organization, you can make different libraries and install your packages into the library of your choosing. After your first session, you will only be asked to select the CRAN mirror when installing a package.

 

GSL Environment Module

Some R packages require a newer version of the GNU Scientific Library (GSL). This can be accomplished with:

$ module load R/4.4.0
$ module load gsl/2.6
$ R
> install.packages("<package-name>")

IMPORTANT: If you built a package with the gsl module loaded then you will need to add module load gsl/<version> to your Slurm script. If using OnDemand then enter this into the "Additional environment module(s) to load" field.

You do not need to load any modules to install sf, rgdal, rstan, brms, lwgeom, geojsonio and terra.

 

Installing ncdf4 and hdf5r

To install ncdf4, you need to load two environment modules before starting R. The full session appears as follows:

$ ssh <YourNetID>@della.princeton.edu
$ module load R/4.4.0
$ module load hdf5/gcc/1.10.6 netcdf/gcc/hdf5-1.10.6/4.7.4
$ R
> install.packages("ncdf4")

You must include the two modules for OnDemand RStudio sessions via the "Additional environment module(s) to load" field. If using sbatch then include the two modules in the Slurm script. The procedure above can be used for hdf5r (in this case include hdf5/gcc/1.10.6 and omit netcdf/gcc/hdf5-1.10.6/4.7.4).

If you fail to load the appropriate modules then you will encounter an error like this:

Error, nc-config not found or not executable.  This is a script that comes with the
netcdf library, version 4.1-beta2 or later, and must be present for configuration
to succeed.

 

Using gdal, geos and proj

Several R packages commonly used for creating maps (rgdal, rgeos and maptools) will no longer be available beginning in November 2023. Users should consider using alternative packages such as sf or terra as described here.

We provide environment modules for the following packages on Della:

gdal/3.7.1
geos/3.12.0
proj/9.2.1

The modules above may be useful for working with the sf package.

 

Development Versions 

You can install the released version of a package such as furrr from CRAN with:

> install.packages("furrr")

In certain cases, such as when you need the bleeding-edge changes and bug fixes, you should install the development version from GitHub with:

# install.packages("devtools")
> devtools::install_github("DavisVaughan/furrr")

 

Using Conda

One can create an isolated Conda environment composed of R packages and R itself. You can search for these packages on anaconda.org. For instance, to create a Conda environment that includes rmapshaper and other packages:

$ module load anaconda3/2023.9
$ conda create --name mshpr-env r-dplyr r-rmapshaper r-sf --channel conda-forge
$ conda activate mshpr-env
$ R
> library(rmapshaper)
> q()

Note that a Conda environment composed of R packages comes with its own R executable. Be sure to load the anaconda3/2023.9 module and activate the environment in your Slurm script. See section below for using RStudio.

 

Custom Modules

You can create your own environment modules which can then be loaded for an OnDemand session. For example, one can create a Conda environment of R packages and load the module for this environment. See the directions for creating custom modules. To get this work on MyAdroit, add your module files here:

/home/<YourNetID>/Modules/modulefiles/

Specify the name of the module in the appropriate field when creating the OnDemand session. Here is an example module (replace aturing with your NetID):

#%Module
proc ModulesHelp { } {
   puts stderr "This module adds solar to your path"
}
module-whatis "This module adds solar to your path\n"
set basedir "/home/aturing/.conda/envs/mshpr-env"
prepend-path PATH "${basedir}/bin"
prepend-path LD_LIBRARY_PATH "${basedir}/lib"

 

Submitting Jobs to the Batch Scheduler

The following Slurm script could be used to run a serial R job:

#!/bin/bash
#SBATCH --job-name=R-serial      # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G per cpu-core is default)
#SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=all          # send email on start, end and fault
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
module load R/4.4.0  
Rscript myscript.R

If you built a package with the gsl module loaded then you will need to add module load gsl/<version> before the Rscript command in the script above. The same applies for other modules.

Follow the commands below to run your first R script on Della, for example:

$ ssh <YourNetID>@della.princeton.edu
$ cd /scratch/gpfs/<YourNetID>
$ git clone https://github.com/PrincetonUniversity/hpc_beginning_workshop
$ cd hpc_beginning_workshop/serial_R
# edit email address in job.slurm
$ sbatch job.slurm

There is a similar example to that above for the Adroit cluster here.

 

Example of Installing Packages, Uploading Files and Running a Job

1. Install the required R packages

Connect via VPN if off-campus. Browse to myadroit or mydella and choose "Clusters" then "Adroit/Della Cluster Shell Access". This will open a black terminal screen. Run these commands (for your specific R packages):

$ R
> install.packages(c("dplyr", "lubridate"))
# answer "yes" twice regarding a personal library if asked
# choose OH as the mirror by entering the appropriate number
> q()

2. Upload your files

Return to your browser tab with the OnDemand main menu. Choose "Files" then "/scratch/network/<YourNetID>" on MyAdroit or "/scratch/gpfs/<YourNetID>" on MyDella. Choose "New Dir" to make a directory with a name you create (below this is referred to as <JobDirectory>). Double click on the newly created directory to open it. Choose "Upload" to transfer your R script, data files and Slurm script (job.slurm) from your local computer to Adroit/Della. If you need to edit a file after uploading then choose "Edit". You can also create new files. See a video demonstration of uploading files and many other operations in OnDemand.

3. Submit the job

Return to the tab with the black terminal. Run these commands:

$ cd /scratch/network/<YourNetID>/<JobDirectory>  # or /scratch/gpfs for della
$ sbatch job.slurm

To monitor the status of the job:

$ squeue --me

Once the job is complete you can download the files using the MyAdroit/MyDella GUI. To learn more about Slurm and the Linux command line see this guide.

 

Where to Run R Jobs

Adroit and Della are ideal for R jobs. The Tiger and Stellar clusters were designed for parallel jobs that require multiple nodes making them unfit for R. The scheduler on these clusters has been configured to give serial or single-node jobs the lowest priority. In some cases, squeue will classify the reason that the small job is pending as (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions). This is indicating that the required resources are being used for large jobs. Your serial or single-node job will eventually run, however. If you only have an account on Tiger and you want to run several small R jobs then please write to [email protected] to request an account on Della. Be sure to explain the situation.

 

Optimizing Performance

The performance of numerically intensive packages such as RStan can be improved through compiler optimizations and vectorization. If you are an advanced user, before installing such a package, you may consider turning on these optimizations by creating a ~/.R/Makevars file containing these lines:

CC = gcc
CXX = g++
FC = gfortran
CFLAGS = -O3 -ffast-math -march=native -fwhole-program -fpic -m64
CXXFLAGS = -O3 -ffast-math -march=native -fwhole-program -fpic -m64
FFLAGS = -O3 -ffast-math -march=native -fwhole-program -fpic -m64

After installing such a package with the Makevars settings above, you must remove or rename the Makevars file to prevent the optimizations from creating incompatibilities with packages to be installed at a later time.

There may be times when you will need to specify a language standard. This can be done by adding a line to Makevars such as:

CXX14STD = -std=c++14

 

Getting R Packages onto a Secure VM

Most VMs are unable to reach the Internet and unreachable from the Internet for security purposes. In such cases one cannot directly install packages. One solution to this is to setup a similar environment on another machine which has Internet access and then copy it over (or have someone copy it). We generally suggest getting an account on Adroit, one of our cluster machines, where you can use the head node of the cluster to create the environment. You can request an account on here.

Once you have an account on Adroit, and are connected to one of the University VPN services, you can SSH from your computer directly to Adroit. You'll then install the R packages. Here is an example session:

$ ssh <YourNetID>@adroit.princeton.edu
$ mkdir mylibs
$ export R_LIBS_USER=/home/<YourNetID>/mylibs
$ R
> .libPaths()  # "/home/<YourNetID>/mylibs" should appear in the list
> install.packages(c("dplyr", "ggplot2", "lubridate", "caret"))
...
Selection: 72  # choose a mirror such as "USA (OH) [https]"
...

When finished, quit R and return to the command line. Then use 'tar' to compress the mylibs directory:

$ tar cvzf mylibs.tar.gz mylibs

On the VM

Transfer mylibs.tar.gz to the VM and unpack it. In some cases this will need to be done by a member of Research Computing. To unpack the file use:

$ tar xzf mylibs.tar.gz

Then do:

$ export R_LIBS_USER=<path/to>/mylibs
$ R

You should be able to load the libraries in R. If you encounter problems use .libPaths() to check the paths that are searched for R libraries. You can also look inside the mylibs directory to check for the existence of certain packages and their dependencies.

 

Parallel R

For directions on building Rmpi and approaches to parallelizing R scripts see this workshop repository. If you are using Rmpi on Della and you find that jobs hang, try adding this line to the end of your R script:

Rmpi::mpi.quit()

 

FAQ

1. I tried to install an R package but the installation failed with this error message: for loop initial declarations are only allowed in C99 mode. What should I do?

This problem can be solved by loading a newer version of GCC. To do this, before starting R, run this command on the command line: module load gcc-toolset/12

2. Nothing is working properly and I want to delete all my R packages and start over. How do I do this?

To delete all of your R packages and related files, run this command:

$ rm -rf ~/R ~/.R ~/.rstudio* ~/.Rhistory ~/.Rprofile ~/.RData

You may also need to remove lines from your ~/.bashrc file if you added or modified environment variables.

3. How do I see where the R packages are installed?

The paths to system and user packages can be seen with this R command:

> .libPaths()

To specifically see the path to user packages use:

> Sys.getenv("R_LIBS_USER")

4. How do I see my installed packages? All base and user packages can be listed with the R command:

> installed.packages()

5. How do I see the default packages? Default packages can be listed with the R command:

> getOption("defaultPackages")

6. I have a list of packages. How do I install them all at once?

> install.packages(c("<package-name-1>", "<package-name-2>", ...))

7. How do I remove a package? A package can be removed with the command:

> remove.packages("<package-name>")

8. I have the source code for a package I want to install. How do I perform the installation? For RMPI, for instance, use this command on the command line:

$ R CMD INSTALL -l ~/.local/lib --no-test-load Rmpi_0.7-1.tar.gz

Then start R and do:

> library("Rmpi", lib.loc="/home/<YourNetID>/.local/lib")

Or one could set an R environment variable in ~/.bashrc:

export R_LIBS=~/.local/lib:$R_LIBS

9. How can I solve the following error?

Installing package into ‘/home/ceisgrub/R/x86_64-redhat-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
Error in structure(.External(.C_dotTclObjv, objv), class = "tclObj") :
  [tcl] grab failed: window not viewable.

R is trying to display a list of mirrors via X11 forwarding so try unsetting DISPLAY before starting R:

$ unset DISPLAY
$ R
> install.packages("<package-name>")

10. Which BLAS/LAPACK library is R using?

This information is available by running the following command:

> sessionInfo()

11. How should I deal with this error: 'ERROR: failed to lock directory '/home/aturing/R/x86_64-redhat-linux-gnu-library/4.0' for modifying. Try removing '/home/aturing/R/x86_64-redhat-linux-gnu-library/4.0/00LOCK-data.table'?

Follow the suggestion which says to remove a specific directory. This can be done with:

$ rm -rf /home/aturing/R/x86_64-redhat-linux-gnu-library/4.0/00LOCK-data.table

Be sure to use the path from your own case. If you are using RStudio on MyAdroit or MyDella then this command must be run on the command line. From the OnDemand main menu, click on "Clusters" and then "<Name> Cluster Shell Access". This will take you to a black terminal window where you can run the command.

12. How should I deal with this error: "Error: protect(): protection stack overflow."?

This error can occur when working with large data files. We are not aware of the solution for RStudio on MyAdroit or MyDella but if you submit a job to the Slurm scheduler, call Rscript in this way:

Rscript --max-ppsize=500000 myscript.R

13. What does "Status code 502" mean when using OnDemand RStudio?

This error can arise when a user tries to run two commands at once. Try letting each command finish before running the next. For instance, if the script is executing then do not try to also save the R file. The error can also be an indication of requesting insufficient RAM. Try starting a new session with more RAM via the field "Memory allocated for the job, in GBs".

14. When using OnDemand, I am presented with a prompt reading "Sign in to RStudio" with a username and password field. The line above the prompt reads: "Error: Temporary server error, please try again".

Have you modified the JavaScript settings of your web browser? Or have you installed a plugin recently? Try using a different web browser.

15. In OnDemand, after I click on “Connect to RStudio Server” it takes a very long time to connect. How do I solve this?

This is typically explained by past suspended sessions. Try stopping all OnDemand sessions and then run the command below on the command line:

$ rm -rf /home/<YourNetID>/.local/share/rstudio/sessions/active

 

Getting Help

For Data Analysis

For help on using R with data analysis please see the DSS website. DSS offers online tutorials and training for performing data analysis with R as well as one-on-one appointments.

For R Work on the Research Computing Clusters

If you encounter any difficulties while working with R on the Research Computing clusters then please send an email to [email protected] or attend a walk-in help session.

Additional R Resources at Princeton

To see other centers with R resources on campus, view the exploringr.princeton.edu website.