Research Software Engineers (RSE) work on various projects with their partner academic departments. A selected list of projects executed by RSEs are described below.
Development of ASPIRE Python Package
Josh Carmichael, Research Software Engineer
Chris Langfield, Research Software Engineer
PI: Amit Singer, Program in Applied and Computational Mathematics
ASPIRE (Algorithms for Single Particle Reconstruction) is an open-source Python library which aims to provide a pipeline for processing cryo-EM data. It represents years of cumulative work in mathematics, signal processing, and algorithm design by researchers and students from Professor Amit Singer’s group at Princeton. The ASPIRE RSE team is unifying those efforts into a software framework that can be used by the cryo-EM community at large: theoreticians and experimentalists alike. The package implements significant advances made by Professor Singer and colleagues in the cryo-EM field. This includes contributions to image denoising and correction, particle-picking, class-averaging, and 3D volume estimation. Collaboration with the Flatiron Institute on the FINUFFT non-uniform fast Fourier transform tool has been critical, as this algorithm is at the core of ASPIRE code.
Josh’s contributions include refactoring code for generating simulated molecules to include molecules with cyclic symmetry, porting methods for ab initio reconstruction of cyclically symmetric molecules from MATLAB to Python, and adding a Sphinx-Gallery extension to ASPIRE’s documentation to include example scripts demonstrating the functionality of various components of the ASPIRE software package.
Chris has focused on creating a validation system for ASPIRE code against large cryo-EM datasets from the Electron Microscopy Public Archive (EMPIAR). As ASPIRE comes closer to becoming an end-to-end pipeline, this process is crucial to ensure the reliability and accuracy of the package for general use in the field. He has also contributed to streamlining file I/O components, expanding the suite of tests, and numerical code used to manipulate images in ASPIRE’s unique Fourier-space representations.
View ASPIRE’s documentation at: https://computationalcryoem.github.io/ASPIRE-Python
Bill Hasling, Research Software Engineer
Amy Defnet, Research Software Engineer
PI: Laura Condon (University of Arizona)
Co-PI: Reed Maxwell, Civil and Environmental Engineering (Princeton University)
HydroGEN is a web-based machine learning (ML) platform to generate custom hydrologic scenarios on demand. It combines powerful physics-based simulations with ML and observations to provide customizable scenarios from the bedrock through the treetops. Without any prior modeling experience, water managers and planners can directly manipulate state-of-the-art tools to explore scenarios that matter to them. HydroGEN is funded by a National Science Foundation grant as a joint project with Princeton University and University of Arizona.
Created the software architecture for the web-based application in consultation with CyVerse, which is another partner with University of Arizona. Implementation is a microservice-based architecture using Docker components and a NATS message bus designed to be flexible and portable to other data centers, if needed. We use Keycloak and OAuth 2.0 security for login and secure REST-based APIs. Deployment is handled with Kubernetes, and the user interface is developed with React and Material Design. Created a flexible model for web-based visualizations and established good software engineering practices for logging, unit testing, code quality and development/QA/production environments. Created python-based data ingestion pipelines to collect, clean, and locally store external observations data via several government agencies' APIs. This data will be used as input to models that develop novel approximations of features for selected watersheds. Through the use of database tables, newly-available data can be regularly queried and metadata about the locally-stored data can be easily obtained.
Bei Wang, Senior Research Software Engineer
PI: Peter Elmer, Senior Research Physicist, Princeton Physics Department
The reconstruction of charged particle trajectories (tracking) is a pivotal element of the reconstruction chain in Compact Muon Solenoid (CMS) as it measures the direction and momentum of charged particles, which is then also used as input for nearly all high-level reconstruction algorithms: vertex reconstruction, b-tagging, lepton reconstruction, jet isolation and missing transverse momentum reconstruction. Tracking by far is the most time-consuming step of event construction and its time scales poorly with the detector occupancy. The parallel Kalman Filter tracking project mkFit was established in 2014 with the goal of enabling efficient tracking on modern computing architectures. Over the last 6 years, significant progress has been made towards developing a parallelized and vectorized implementation of the combinatoric Kalman Filter algorithm for tracking . This allows the efficient global reconstruction of the entire event within the projected online CPU budget.
Global reconstruction necessarily entails the unpacking and clustering of the hit information from all silicon strip tracker modules before the hits are processed by mkFit. The current CMS HLT, on the other hand, performs hit and track reconstruction on demand, i.e., only for software-selected regions of the detector. To enable efficient unpacking and clustering steps so that the hits can be processed all at once for the entire detector, a parallelized version of unpacking and clustering is developed for both multi-core CPU and many-core GPU . Throughput is further improved by concurrently processing multiple events using nested OpenMP parallelism on CPU or CUDA streams on GPU. The new implementation along with earlier work in developing a parallelized and vectorized implementation of the combinatoric Kalman Filter algorithm has enabled efficient global reconstruction of the entire event on modern computer architectures.
: S. Lantz et al. Speeding up particle track reconstruction using a parallel Kalman filter algorithm, Journal of Instrumentation, Volume 15, Sep 2020
: G. Cerati et al. Parallelizing the unpacking and clustering of detector data for reconstruction of charged particle tracks on multi-core CPUs and many-core GPUs, arXiv preprint arXiv:2101.11489, Jan 2021
Vineet Bansal, Senior Research Software Engineer
PI: Mona Singh, Department of Computer Science (Genomics)
Over the years, students at "SinghLab" under Prof. Mona Singh have developed several algorithms for "domain" identification on protein sequences. Domains are segments in the protein chain that have largely evolved independently, internally maintain their structure, and are thus useful units for analyses. Further, multiple students at SinghLab have developed algorithms that are able to identify regions within a domain that are ripe for binding (with ligands). This allows practitioners in the field to target regions of the protein that are most likely to be susceptible to a reaction. Prof. Singh wanted an integrated web application that allowed users to dine a la carte on these several approaches developed over the years. This effort would also help polish and document code developed by graduate students.
We took several independent codebases developed over time, some in Python and others in Perl, and developed an integrated Web Portal for Protein Domain analysis. This allows users to run these algorithms on their protein sequences, with no programming or infrastructure requirements. The web application is hosted at Research Computing at Princeton. In the process of developing this web application, we also streamlined the data-processing pipeline, making it easier for future SinghLab researchers to add their own algorithms for domain identification and ligand-binding scoring.
Calla E. Chennault, Research Software Engineer
PI: Reed Maxwell, Civil and Environmental Engineering
HydroFrame (hydroframe.org) is a platform for interacting with national hydrologic simulations. The goal of the platform is to simplify user interaction with large, computationally intensive hydrologic models and their massive simulated outputs. Currently, HydroFrame’s Beta Release allows users to subset inputs from national models, generate a model domain and a run script from these inputs, and run a small runoff test. The next version of HydroFrame aims to enhance its model subsetting options and model outputs, as well as provide more extensive and customizable simulations with more flexible analysis/visualization.
Implemented a web endpoint on Princeton hardware which will connect to the existing HydroFrame subsetter and allow users to select from a range of workflows concerning their watershed. After launching the endpoint from the subsetter, users will be able to: launch and run a pre-populated, customizable ParFlow model; interact with model outputs and modify inputs to launch a rerun; review previous runs and their parameter specifications; and launch a Jupyter notebook. This web interface will help to remove initial barriers for use and development of national water models.
ASPIRE - NUFFT
Garrett Wright, Senior Research Software Engineer
PI: Amit Singer, Department of Mathematics; Alex Barnett, Flatiron Institute
Underlying ASPIRE algorithms, the Non Uniform Fast Fourier Transform is a core numerical method dominating computational time in current applications. ASPIRE directly depends on external packages to provide portable and validated high performance solutions for this method. To facilitate this, we have been collaborating closely with the Flatiron Institute, home to the state of the art open source FINUFFT implementation. In this collaboration, PACM has contributed directly to a highly optimized CPU package FINUFFT and a CUDA GPU based cuFINUFFT.
Contributions include refactoring and developing the C/C++/CUDA code to support the following features: build system abstractions, dual precision support, a new API for cuFINUFFT, python C bindings, pip packaging, creation of (CUDA backed) binary distribution wheels, and initiating efforts for automated CI/CD. Early drafts of this work were leveraged by the ASPIRE team at the 2020 Princeton GPU Hackathon to yield speedups from 2x-10x as a proof of concept. Results of this collaboration are fully integrated in ASPIRE-Python as of v0.6.2.
Packages are proudly open source and can be found here:
Troy Comi, Research Software Engineer
PI: Josh Akey, The Lewis-Sigler Institute of Integrative Genomics
Admixture has played a prominent role in shaping patterns of human genomic variation, including gene flow with now-extinct hominins like Neanderthals and Denisovans. IBDmix is a novel probabilistic method which identifies introgressed hominin sequences, which unlike existing approaches, does not use a modern reference population.
I fully refactored the exploratory codebase to utilize modern C++, build systems (Cmake), and unit testing with google test. The original algorithm was replaced with a streaming implementation, leveraging a custom stack class which is tuned for the rapid push/pop cycles to limit object creation. Overall, runtimes were kept fast while decreasing memory usage from O(n) to O(1). Outputs from the original code are utilized for regression, acceptance tests, which are run with github actions for each push. The entire workflow is encapsulated in a snakemake pipeline to demonstrate how each component interacts and to reproduce the published dataset.
Code available at: github.com/PrincetonUniversity/IBDmix
Cell Patch Polarity Heatmaps
Abhishek Biswas, Research Software Engineer
PI: Danelle Davenport, Department of Molecular Biology
Tissue Analyzer, an ImageJ plugin, is used by the researchers in Davenport Lab to process confocal tissue images and generate cell segmentation masks and cell polarities. The cell polarities can be used by the tool PackAttack2.0 for generating a plot of the polarity orientations for the whole image. However, for certain types of analysis the researchers wanted to visually show local polarity hotspots for cell patches of various diameters.
Implemented cell polarity visualization over multiple concentric cell patches in PackAttack2.0. The local polarity hot spots in the images of fluorescently labeled cells of the epidermis can now be clearly shown as heatmaps and help answer questions about changes in local cell polarity. The images below show the local cell polarity heatmaps for cell patches of diameter 1, 2 and 4 cells. The high polarity hotspots can be clearly seen in stronger shades of red.
INTERSECT RSE Training
Ian Cosden, Director, Research Software Engineering for Computational & Data Science
Software forms the backbone of much current research in a variety of scientific and engineering domains. The breadth and sophistication of software skills required for modern research software projects is increasing at an unprecedented pace. Despite this fact, an alarming number of researchers who develop software do not have adequate training in software development. Therefore, it is imperative for researchers who develop the software that will drive tomorrow’s critical research discoveries to have access to software engineering training at multiple stages of their career, to not only make them more productive, but also to make their software more robust, reliable, and sustainable. INTERSECT (INnovative Training Enabled by a Research Software Engineering Community of Trainers) provides training on software development and engineering practices to research software developers who already possess an intermediate or advanced level of knowledge. INTERSECT, through training events and open-source material, is building a pipeline of computational researchers trained in best practices for research software development.
Project website: www.intersect-training.github.io