Research Software Engineers (RSE) work on various projects with their partner academic departments. A selected list of projects executed by RSEs are described below.
Josh Carmichael, Research Software Engineer
Chris Langfield, Research Software Engineer
PI: Amit Singer, Program in Applied and Computational Mathematics
ASPIRE (Algorithms for Single Particle Reconstruction) is an open-source Python library which aims to provide a pipeline for processing cryo-EM data. It represents years of cumulative work in mathematics, signal processing, and algorithm design by researchers and students from Professor Amit Singer’s group at Princeton. The ASPIRE RSE team is unifying those efforts into a software framework that can be used by the cryo-EM community at large: theoreticians and experimentalists alike. The package implements significant advances made by Professor Singer and colleagues in the cryo-EM field. This includes contributions to image denoising and correction, particle-picking, class-averaging, and 3D volume estimation. Collaboration with the Flatiron Institute on the FINUFFT non-uniform fast Fourier transform tool has been critical, as this algorithm is at the core of ASPIRE code.
Josh’s contributions include refactoring code for generating simulated molecules to include molecules with cyclic symmetry, porting methods for ab initio reconstruction of cyclically symmetric molecules from MATLAB to Python, and adding a Sphinx-Gallery extension to ASPIRE’s documentation to include example scripts demonstrating the functionality of various components of the ASPIRE software package.
Chris has focused on creating a validation system for ASPIRE code against large cryo-EM datasets from the Electron Microscopy Public Archive (EMPIAR). As ASPIRE comes closer to becoming an end-to-end pipeline, this process is crucial to ensure the reliability and accuracy of the package for general use in the field. He has also contributed to streamlining file I/O components, expanding the suite of tests, and numerical code used to manipulate images in ASPIRE’s unique Fourier-space representations.
View ASPIRE’s documentation at: https://computationalcryoem.github.io/ASPIRE-Python
Bill Hasling, Research Software Engineer
Amy Defnet, Research Software Engineer
PI: Laura Condon (University of Arizona)
Co-PI: Reed Maxwell, Civil and Environmental Engineering (Princeton University)
HydroGEN is a web-based machine learning (ML) platform to generate custom hydrologic scenarios on demand. It combines powerful physics-based simulations with ML and observations to provide customizable scenarios from the bedrock through the treetops. Without any prior modeling experience, water managers and planners can directly manipulate state-of-the-art tools to explore scenarios that matter to them. HydroGEN is funded by a National Science Foundation grant as a joint project with Princeton University and University of Arizona.
Created the software architecture for the web-based application in consultation with CyVerse, which is another partner with University of Arizona. Implementation is a microservice-based architecture using Docker components and a NATS message bus designed to be flexible and portable to other data centers, if needed. We use Keycloak and OAuth 2.0 security for login and secure REST-based APIs. Deployment is handled with Kubernetes, and the user interface is developed with React and Material Design. Created a flexible model for web-based visualizations and established good software engineering practices for logging, unit testing, code quality and development/QA/production environments. Created python-based data ingestion pipelines to collect, clean, and locally store external observations data via several government agencies' APIs. This data will be used as input to models that develop novel approximations of features for selected watersheds. Through the use of database tables, newly-available data can be regularly queried and metadata about the locally-stored data can be easily obtained.
Vineet Bansal, Senior Research Software Engineer
PI: Mona Singh, Department of Computer Science (Genomics)
Over the years, students at "SinghLab" under Prof. Mona Singh have developed several algorithms for "domain" identification on protein sequences. Domains are segments in the protein chain that have largely evolved independently, internally maintain their structure, and are thus useful units for analyses. Further, multiple students at SinghLab have developed algorithms that are able to identify regions within a domain that are ripe for binding (with ligands). This allows practitioners in the field to target regions of the protein that are most likely to be susceptible to a reaction. Prof. Singh wanted an integrated web application that allowed users to dine a la carte on these several approaches developed over the years. This effort would also help polish and document code developed by graduate students.
We took several independent codebases developed over time, some in Python and others in Perl, and developed an integrated Web Portal for Protein Domain analysis. This allows users to run these algorithms on their protein sequences, with no programming or infrastructure requirements. The web application is hosted at Research Computing at Princeton. In the process of developing this web application, we also streamlined the data-processing pipeline, making it easier for future SinghLab researchers to add their own algorithms for domain identification and ligand-binding scoring.
George Artavanis, Research Software Engineer
Calla E. Chennault (2020-2022), Research Software Engineer
PI: Reed Maxwell, Civil and Environmental Engineering
HydroFrame (hydroframe.org) is a platform for interacting with national hydrologic simulations. The goal of the platform is to simplify user interaction with large, computationally intensive hydrologic models and their massive simulated outputs. Currently, HydroFrame’s Beta Release allows users to subset inputs from national models, generate a model domain and a run script from these inputs, and run a small runoff test. The next version of HydroFrame aims to enhance its model subsetting options and model outputs, as well as provide more extensive and customizable simulations with more flexible analysis/visualization.
Implemented a web endpoint on Princeton hardware which will connect to the existing HydroFrame subsetter and allow users to select from a range of workflows concerning their watershed. After launching the endpoint from the subsetter, users will be able to: launch and run a pre-populated, customizable ParFlow model; interact with model outputs and modify inputs to launch a rerun; review previous runs and their parameter specifications; and launch a Jupyter notebook. This web interface will help to remove initial barriers for use and development of national water models.
Garrett Wright, Senior Research Software Engineer
PI: Amit Singer, Department of Mathematics; Alex Barnett, Flatiron Institute
Underlying ASPIRE algorithms, the Non Uniform Fast Fourier Transform is a core numerical method dominating computational time in current applications. ASPIRE directly depends on external packages to provide portable and validated high performance solutions for this method. To facilitate this, we have been collaborating closely with the Flatiron Institute, home to the state of the art open source FINUFFT implementation. In this collaboration, PACM has contributed directly to a highly optimized CPU package FINUFFT and a CUDA GPU based cuFINUFFT.
Contributions include refactoring and developing the C/C++/CUDA code to support the following features: build system abstractions, dual precision support, a new API for cuFINUFFT, python C bindings, pip packaging, creation of (CUDA backed) binary distribution wheels, and initiating efforts for automated CI/CD. Early drafts of this work were leveraged by the ASPIRE team at the 2020 Princeton GPU Hackathon to yield speedups from 2x-10x as a proof of concept. Results of this collaboration are fully integrated in ASPIRE-Python as of v0.6.2.
Packages are proudly open source and can be found here:
Troy Comi, Research Software Engineer
PI: Josh Akey, The Lewis-Sigler Institute of Integrative Genomics
Admixture has played a prominent role in shaping patterns of human genomic variation, including gene flow with now-extinct hominins like Neanderthals and Denisovans. IBDmix is a novel probabilistic method which identifies introgressed hominin sequences, which unlike existing approaches, does not use a modern reference population.
I fully refactored the exploratory codebase to utilize modern C++, build systems (Cmake), and unit testing with google test. The original algorithm was replaced with a streaming implementation, leveraging a custom stack class which is tuned for the rapid push/pop cycles to limit object creation. Overall, runtimes were kept fast while decreasing memory usage from O(n) to O(1). Outputs from the original code are utilized for regression, acceptance tests, which are run with github actions for each push. The entire workflow is encapsulated in a snakemake pipeline to demonstrate how each component interacts and to reproduce the published dataset.
Code available at: github.com/PrincetonUniversity/IBDmix
Abhishek Biswas, Research Software Engineer
PI: Danelle Davenport, Department of Molecular Biology
Tissue Analyzer, an ImageJ plugin, is used by the researchers in Davenport Lab to process confocal tissue images and generate cell segmentation masks and cell polarities. The cell polarities can be used by the tool PackAttack2.0 for generating a plot of the polarity orientations for the whole image. However, for certain types of analysis the researchers wanted to visually show local polarity hotspots for cell patches of various diameters.
Implemented cell polarity visualization over multiple concentric cell patches in PackAttack2.0. The local polarity hot spots in the images of fluorescently labeled cells of the epidermis can now be clearly shown as heatmaps and help answer questions about changes in local cell polarity. The images below show the local cell polarity heatmaps for cell patches of diameter 1, 2 and 4 cells. The high polarity hotspots can be clearly seen in stronger shades of red.
Ian Cosden, Director, Research Software Engineering for Computational & Data Science
Software forms the backbone of much current research in a variety of scientific and engineering domains. The breadth and sophistication of software skills required for modern research software projects is increasing at an unprecedented pace. Despite this fact, an alarming number of researchers who develop software do not have adequate training in software development. Therefore, it is imperative for researchers who develop the software that will drive tomorrow’s critical research discoveries to have access to software engineering training at multiple stages of their career, to not only make them more productive, but also to make their software more robust, reliable, and sustainable. INTERSECT (INnovative Training Enabled by a Research Software Engineering Community of Trainers) provides training on software development and engineering practices to research software developers who already possess an intermediate or advanced level of knowledge. INTERSECT, through training events and open-source material, is building a pipeline of computational researchers trained in best practices for research software development.
Project website: www.intersect-training.github.io
Sangyoon Park, Research Software Engineer (Data-Driven Social Science)
PI: Sylvain Chassang (Princeton University Department of Economics)
Co-PI: Laura Boudreau (Columbia Business School)
Survey participants often feel reluctant to share their true experience because they are worried about potential retaliation in case their responses are identified (e.g., data leakage). This is especially the case for sensitive survey questions such as those asking about sexual harassment in the workplace. As a result, survey administrators (e.g., company management, researchers) often get inaccurate representation of the reality, which makes it hard to devise an appropriate course of action.
Safely Report is a survey web application that can provide plausible deniability to survey respondents by recording survey responses with noise. For instance, when asking a worker whether they have been harassed by a manager, the application can be set up to record the answer "yes" with a probability of 30% even if the worker responds "no". This makes it nearly impossible to correctly identify which responses (of all those recorded "yes") are truthful reports — even if the survey results are leaked. Yet, the survey designer can still know the proportion and other statistics of truthful reports because the application tracks the number of cases (but not the cases themselves) where noise injection has happened. Consequently, survey participants feel more safe and become more willing to share their true experience, which has been confirmed by a relevant study.
Safely Report aims to provide interested researchers with an open source tool (available under an MIT license) that implements secure survey techniques developed by Sylvain Chassang (Princeton) and Laura Boudreau (Columbia) such that the researchers can more easily adapt and use these techniques in their own research. The software supports XLSForm, which is an Excel-based survey specification standard widely used by researchers to design and conduct complex surveys, so it can well integrate into the existing user base.
Safely Report offers several advantages over existing XLSForm-compliant survey tools:
- New Security Features. Foremost, it supports the novel techniques for secure survey, which are more difficult to implement in other survey tools.
- Technically Accessible. It is a lightweight Python-based application, so researchers may adapt and deploy it fully on their own.
- Free to Use. It is completely free unlike some other survey tools that operate under paid plans (e.g., SurveyCTO).
The software is under active development at the moment and is planned to be open sourced in May 2024.