Sharing Data Outside of Princeton

This page describes ways that users of the Princeton HPC clusters can share their data beyond the university. If you are looking to share data with another Princeton HPC user then see Sharing Data with Other Users.

It is important to keep in mind the distinction between sharing versus publishing. Sharing typically means making data available to collaborators while a project is underway. Publishing involves handing the data over to a publisher which will assign a unique ID to the data and make it available on a long-term basis. Often times there are more advantages to publishing data versus simply sharing it.

 

Common Terminal Tools

The starting point for transferring data from a Princeton system to elsewhere are the common terminal tools: scp, sftp, rsync and others. These tools are a good choice when transferring individual files up to entire data sets of tens of gigabytes in size. If you are on a high-speed network then you will be able to transfer hundreds of gigabytes is a reasonable amount of time (less than a day).

The commands below provide an example of transferring a single file on tiger to a second account at ORNL:

$ ssh <YourNetID>@tiger.princeton.edu
$ cd /scratch/gpfs/<YourNetID>
$ scp file.dat <Username>@dtn.ccs.ornl.gov:/data/

To learn more about scp and other common tools see this page. Consider compressing your data using gunzip or bzip2 before transferring it.

 

Globus

Globus uses multiple streams to transfer data giving it a significant performance advantage over the common terminal tools and making it the obvious choice for very large transfers. However, to use Globus it must be available at both endpoints. Learn more about Globus at Princeton.

In some cases it may be necessary for an external collaborator to be given a temporary account at Princeton in order to make a transfer. In this scenario, a Princeton faculty or staff member would need to request a Research Computing User (RCU) account on their behalf from OIT. After the account is made then a second request can be made for a faculty-sponsored account on the Research Computing clusters.

 

DataSpace

Dataspace offers long-term storage and publication options for datasets, visualizations or reports.

 

Rclone

Rclone provides an option for transferring data to cloud storage, such as Dropbox or Google Drive. It is already installed on login nodes, but users must configure it for their cloud storage applications before using it. For Dropbox, for example, the configuration instructions are described here: https://rclone.org/dropbox/. For Google Drive, the configuration instructions are described here: https://rclone.org/drive/.

 

Publishing Large Datasets

See this guide by Princeton Research Data Service.

 

tigress-web

See this page to learn how to make files in /tigress or /projects available such that external collaborators can access them using a web browser or the wget command, for instance. Note that there are a few constraints associated with this approach.