Available on all clusters via various filesystems

IMPORTANT: Research Computing is not able to grant quota increases on /projects. Users will need to use the /scratch/gpfs filesystems which are not backed up. For additional storage needs, consider whether TigerData, is the right fit for you.

Overview

Research Computing maintains several high-performance computing clusters that enable users to build codes, run jobs, and store the results. A new data management service called TigerData provides a suite of software tools and scalable, tiered storage to enable the organization, description, storage, sharing, and long-term sustainability of research and administrative data.

The schematic diagrams below show the large clusters and the filesystems that are available to each.

Della

Della is large, multi-purpose cluster composed of both CPU and GPU nodes.

Data Storage for Della

 

Tiger

Tiger is a large CPU cluster for parallel, multinode jobs. It offers a small number of GPU nodes. The /projects and /tigerdata filesystems are not available on the compute nodes of Tiger.

Schematic Diagram of Tiger Cluster

 

Stellar

Stellar is a large CPU cluster for parallel, multinode jobs. It offers a small number of GPU nodes.

Data Storage for Stellar

 

Filesystem Details

Important details about each filesystem are listed below:

  • /home/<YourNetID>
    • The /home directory of a user is for source code, executables, Conda environments, R packages, Julia packages, and small data sets.
    • The /home directory of each user is backed up with the exception of the .conda, .cache and .vscode directories.
  • /scratch/gpfs/<YourNetID> (or /scratch/network/<YourNetID> on Adroit)
    • The /scratch/gpfs directory of a user is for job input and output files, and for storing intermediate results
    • The /scratch/gpfs filesystem is a fast, parallel filesystem that is local to each cluster which makes it ideal for storing job input and output files. However, because /scratch/gpfs is not backed up you will need to transfer your completed (non-volatile), job files to /projects or /tigerdata for long-term storage. The files belonging to a user in /scratch/gpfs are not purged until many months after the user has left the university. Write to [email protected] for questions about purging. See our Data Transfer page to learn how to move your data.
  • /projects/<ResearchGroup>/
    • Access to these directories is not granted automatically. Your advisor must request that you get access to their /projects directory.
    • Quotas are being reduced on /projects. Increase requests are likely to be denied.
    • /projects is used for the long-term storage of final job output.
    • The /projects storage system is shared by the large clusters via a single, slow connection. It is designed for non-volatile files (i.e., files that do not change over time). For these reasons one should never write the output of actively running jobs to /projects. Doing so may adversely affect the work of other users and it may cause your jobs to run inefficiently. Instead, write your output to /scratch/gpfs/<YourNetID> and then, after the jobs complete, copy or move the output to /projects if a backup is desired. Note that /projects is not available on the compute nodes of Tiger.
    • The implementation of this filesystem uses GPFS (General Parallel File System), an IBM commercial filesystem specifically designed to support high reliability and computing clusters. The data and files on /projects are backed up if you need to recover your files.
    • Only the latest copy of any file residing on /projects is kept on the server, no previous versions of a file are kept. These active files are kept on the backup server indefinitely. Should a file be deleted from /projects, it will remain on the backup server for 10 days after which it will be deleted from the backup server. 
  • /tigerdata/<DirectoryName>
    • TigerData is a new data management service that includes a suite of software tools and scalable, tiered storage to enable the organization, description, storage, sharing, and long-term sustainability of research and administrative data.
    • Visit the TigerData website to learn more.

Two additional important points:

  • /tmp (not shown in the figure) is local scratch space that exists on each compute node for high-speed reads and writes. If file I/O is a bottleneck in your code or if you need to store temporary data then you should consider using this.
  • Data that has been classified as level 0 (public) or level 1 (unrestricted) is allowed on the Research Computing clusters and can be stored within TigerData. For level 2 (confidential) and level 3 (restricted) one must use the Secure Research Infrastructure. Learn more about data classification.

Using the Filesystems

Let's say that you just got an account on one of the Research Computing clusters, here is what you should do next to get started:

  1. Usually the first step is to install software in your /home directory. Most users begin by installing various packages in Python, R or Julia. By default these packages will be installed in /home/<YourNetID>. If you need to build your code from source then transfer the source code to your /home directory and compile it. Builds tools and additional software are available through environment modules. If the software you need is pre-installed like MATLAB or Stata then you are ready to proceed to the next step.
  2. With your software ready to be used, the next step is to run a job. The /scratch/gpfs filesystem on each cluster is the right place for storing job files. Create a directory in /scratch/gpfs/<YourNetID> (or /scratch/network/<YourNetID> on Adroit) and put the necessary input files and Slurm script in that directory. 
  3. Submit the job to the scheduler using the sbatch command (see tutorial). If the run produces output that you want to backup then transfer the files to /projects. The commands below illustrate these steps:
$ ssh <YourNetID>@della.princeton.edu
$ cd /scratch/gpfs/<YourNetID>
$ mkdir myjob
$ cd myjob
# put necessary files and Slurm script in myjob
$ sbatch job.slurm

Your files in /scratch/gpfs are not backed up, so after a job finishes, if you want to backup the output then copy or move the files to /projects using a command like the following:

$ cp -r /scratch/gpfs/<YourNetID>/myjob /projects/<YourAdvisorsLastName>/<YourDirectory>

In summary, install your software in /home, run jobs on /scratch/gpfs, and transfer final job output to /projects or /tigerdata for storage and backup. 

File Quota

Given the small size of /home, users often run out of space which can lead to many issues. If you need to request more space then see the checkquota page. There are also directions on that page for finding and removing large files as well as dealing with large Conda environments.

Note: The checkquota command is not compatible with TigerData. To check your TigerData quota, access the TigerData Web Portal or email [email protected].

Additional Details

The importance of not writing the output of actively running jobs to /projects was emphasized above. Reading files or calling executables from this filesystem is allowed for light workloads. However, in general, one will get better performance when using /scratch/gpfs or /home so those filesystems should be preferred.

A volatile file is one that changes over time. Actively running jobs tends to create volatile files such as a log file that records the progress of the run. One must avoid copying volatile files to /projects since any subsequent change lead to creation of a new backup. The long-term storage systems are for non-volatile files only. Only after a job has completed should the job output files be transferred from /scratch/gpfs to /projects.

There have been multiple failures of /scratch/gpfs in the past. In some cases data was lost. It is your responsibility to copy important files to /projects for backup. Note that once you have copied the files to the backup system you can continue using them on /scratch/gpfs where the I/O performance is optimal.

Data Management

Imagine that two researchers had to leave their positions in a hurry. Based on their files, which work of the two would you want to continue with?

Researcher 1Researcher 2
.
├── figure1.pdf
├── figure2.pdf
├── file1.py
├── file2.py
├── file3.py
├── main.tex
├── out_a_10
├── out_a_20
├── out_a_30
├── out_b_1
├── out_b_2
├── out_b_3
├── refs.bib
├── output1.log
└── output2.log
.
├── code
│   ├── analysis.py
│   ├── main.py
│   ├── README
│   └── tests
│       └── test_analysis.py
├── data
│   ├── effect_of_length
│   │   ├── length.log
│   │   ├── system1_length_10.csv
│   │   ├── system1_length_20.csv
│   │   └── system1_length_30.csv
│   ├── effect_of_width
│   │   ├── width.log
│   │   ├── system1_width_1.csv
│   │   ├── system1_width_2.csv
│   │   └── system1_width_3.csv
│   └── README
├── manuscript
│   ├── figures
│   │   ├── length.pdf
│   │   └── width.pdf
│   └── text
│       ├── main.tex
│       └── refs.bib
└── README

Researcher 2 has done a much better job of arranging and documenting their files. They are using multiple directories, descriptive file names and README files. The README file in the code directory may contain notes such as "The software was written by Alan Turing ([email protected]). Contact Alan if there are issues."

Researcher 1 is storing all of their files in one directory, the names of their files are not descriptive, and there are no README files. It would be difficult for a second person to continue with their work.

Good data management means:

  • Create a logical directory structure
  • You should create README files liberally
  • Don't worry about writing polished text inside the files (be practical)
  • It is perfectly fine to create a README file in every directory (although for most projects this is unnecessary) 
  • Think of these files as notes to yourself at a later time

There is much more to data management than using a proper directory structure and README files. See Princeton Research Data Services for more.

TigerData

TigerData is a new data management service that provides scalable and robust data management tools to the Princeton University community. 

TigerData currently supports data that does not require active data movement and/or heavy manipulation, as is usually supported on computational clusters. The data stored in TigerData is established with data management principles for long-term storage as dictated by research needs. Data is organized by projects (groupings of alike data) and can be accessed via the Research Computing cluster login nodes, NFS or SMB, and Globus.

To learn more or request access, visit the TigerData website.

FAQ