Selected Research Software Engineering Projects

Research Software Engineers (RSE) work on various projects with their partner academic departments. A selected list of projects executed by RSEs are described below.

IRIS-HEP

Bei Wang, Senior Research Software Engineer

PI: Peter Elmer, Senior Research Physicist, Princeton Physics Department

Background

The reconstruction of charged particle trajectories (tracking) is a pivotal element of the reconstruction chain in Compact Muon Solenoid (CMS) as it measures the direction and momentum of charged particles, which is then also used as input for nearly all high-level reconstruction algorithms: vertex reconstruction, b-tagging, lepton reconstruction, jet isolation and missing transverse momentum reconstruction. Tracking by far is the most time-consuming step of event construction and its time scales poorly with the detector occupancy. The parallel Kalman Filter tracking project mkFit was established in 2014 with the goal of enabling efficient tracking on modern computing architectures. Over the last 6 years, significant progress has been made towards developing a parallelized and vectorized implementation of the combinatoric Kalman Filter algorithm for tracking [1]. This allows the efficient global reconstruction of the entire event within the projected online CPU budget.

Contribution

Global reconstruction necessarily entails the unpacking and clustering of the hit information from all silicon strip tracker modules before the hits are processed by mkFit. The current CMS HLT, on the other hand, performs hit and track reconstruction on demand, i.e., only for software-selected regions of the detector. To enable efficient unpacking and clustering steps so that the hits can be processed all at once for the entire detector, a parallelized version of unpacking and clustering is developed for both multi-core CPU and many-core GPU [2]. Throughput is further improved by concurrently processing multiple events using nested OpenMP parallelism on CPU or CUDA streams on GPU. The new implementation along with earlier work in developing a parallelized and vectorized implementation of the combinatoric Kalman Filter algorithm has enabled efficient global reconstruction of the entire event on modern computer architectures.

[1]: S. Lantz et al.  Speeding up particle track reconstruction using a parallel Kalman filter algorithm, Journal of Instrumentation, Volume 15, Sep 2020

[2]: G. Cerati et al. Parallelizing the unpacking and clustering of detector data for reconstruction of charged particle tracks on multi-core CPUs and many-core GPUs, arXiv preprint arXiv:2101.11489, Jan 2021


 

ProtDomain

Binding scores for the CTCF protein for different ligand types. Higher bars indicate higher binding affinity at respective positions along the protein.

Binding scores for the CTCF protein for different ligand types. Higher bars indicate higher binding affinity at respective positions along the protein.

Vineet Bansal, Senior Research Software Engineer

PI: Mona Singh, Department of Computer Science (Genomics)

Background

Over the years, students at "SinghLab" under Prof. Mona Singh have developed several algorithms for "domain" identification on protein sequences. Domains are segments in the protein chain that have largely evolved independently, internally maintain their structure, and are thus useful units for analyses. Further, multiple students at SinghLab have developed algorithms that are able to identify regions within a domain that are ripe for binding (with ligands). This allows practitioners in the field to target regions of the protein that are most likely to be susceptible to a reaction. Prof. Singh wanted an integrated web application that allowed users to dine a la carte on these several approaches developed over the years. This effort would also help polish and document code developed by graduate students.

Contribution

We took several independent codebases developed over time, some in Python and others in Perl, and developed an integrated Web Portal for Protein Domain analysis. This allows users to run these algorithms on their protein sequences, with no programming or infrastructure requirements. The web application is hosted at Research Computing at Princeton. In the process of developing this web application, we also streamlined the data-processing pipeline, making it easier for future SinghLab researchers to add their own algorithms for domain identification and ligand-binding scoring.  

Website: protdomain.princeton.edu


  

HydroFrame 

HydroFrame Figure

Calla E. Chennault, Research Software Engineer

PI: Reed Maxwell,  Civil and Environmental Engineering

Background

HydroFrame (hydroframe.org) is a platform for interacting with national hydrologic simulations. The goal of the platform is to simplify user interaction with large, computationally intensive hydrologic models and their massive simulated outputs. Currently, HydroFrame’s Beta Release allows users to subset inputs from national models, generate a model domain and a run script from these inputs, and run a small runoff test. The next version of HydroFrame aims to enhance its model subsetting options and model outputs, as well as provide more extensive and customizable simulations with more flexible analysis/visualization.

Contribution

Implemented a web endpoint on Princeton hardware which will connect to the existing HydroFrame subsetter and allow users to select from a range of workflows concerning their watershed. After launching the endpoint from the subsetter, users will be able to: launch and run a pre-populated, customizable ParFlow model; interact with model outputs and modify inputs to launch a rerun; review previous runs and their parameter specifications; and launch a Jupyter notebook. This web interface will help to remove initial barriers for use and development of national water models.  


  

ASPIRE - NUFFT

Garrett Wright, Research Software Engineer

PI: Amit Singer, Department of Mathematics; Alex Barnett, Flatiron Institute

Background

Underlying ASPIRE algorithms, the Non Uniform Fast Fourier Transform is a core numerical method dominating computational time in current applications. ASPIRE directly depends on external packages to provide portable and validated high performance solutions for this method. To facilitate this, we have been collaborating closely with the Flatiron Institute, home to the state of the art open source FINUFFT implementation. In this collaboration, PACM has contributed directly to a highly optimized CPU package FINUFFT and a CUDA GPU based cuFINUFFT.

Contribution

Contributions include refactoring and developing the C/C++/CUDA code to support the following features: build system abstractions, dual precision support, a new API for cuFINUFFT, python C bindings, pip packaging, creation of (CUDA backed) binary distribution wheels, and initiating efforts for automated CI/CD. Early drafts of this work were leveraged by the ASPIRE team at the 2020 Princeton GPU Hackathon to yield speedups from 2x-10x as a proof of concept. Results of this collaboration are fully integrated in ASPIRE-Python as of v0.6.2.  

Packages are proudly open source and can be found here:

github.com/flatironinstitute/finufft

github.com/flatironinstitute/cufinufft

github.com/ComputationalCryoEM/ASPIRE-Python


IBDmix Figure

  

IBDmix

Troy Comi, Research Software Engineer

PI: Josh Akey, The Lewis-Sigler Institute of Integrative Genomics 

Background

Admixture has played a prominent role in shaping patterns of human genomic variation, including gene flow with now-extinct hominins like Neanderthals and Denisovans. IBDmix is a novel probabilistic method which identifies introgressed hominin sequences, which unlike existing approaches, does not use a modern reference population.   

Contribution

I fully refactored the exploratory codebase to utilize modern C++, build systems (Cmake), and unit testing with google test. The original algorithm was replaced with a streaming implementation, leveraging a custom stack class which is tuned for the rapid push/pop cycles to limit object creation. Overall, runtimes were kept fast while decreasing memory usage from O(n) to O(1). Outputs from the original code are utilized for regression, acceptance tests, which are run with github actions for each push. The entire workflow is encapsulated in a snakemake pipeline to demonstrate how each component interacts and to reproduce the published dataset.

Code available at: github.com/PrincetonUniversity/IBDmix


  

Cell Patch Polarity Heatmaps

Abhishek Biswas, Research Software Engineer

PI: Danelle Davenport, Department of Molecular Biology 

Background

Tissue Analyzer, an ImageJ plugin, is used by the researchers in Davenport Lab to process confocal tissue images and generate cell segmentation masks and cell polarities. The cell polarities can be used by the tool PackAttack2.0 for generating a plot of the polarity orientations for the whole image. However, for certain types of analysis the researchers wanted to visually show local polarity hotspots for cell patches of various diameters.   

Contribution

Implemented cell polarity visualization over multiple concentric cell patches in PackAttack2.0. The local polarity hot spots in the images of fluorescently labeled cells of the epidermis can now be clearly shown as heatmaps and help answer questions about changes in local cell polarity. The images below show the local cell polarity heatmaps for cell patches of diameter 1, 2 and 4 cells. The high polarity hotspots can be clearly seen in stronger shades of red.

Cell Polarity Heatmap Win 1
Cell Polarity Heatmap Win 2
Cell Polarity Heatmap Win 4

  

Development of ASPIRE Python Package

ASPIRE Figure

Junchao Xia, Research Software Engineer

PI: Amit Singer, Program in Applied and Computational Mathematics 

Background

To tackle many problems involved in reconstructing a 3D CryoEM density map of biomolecule from corresponding 2D noisy images, during the past 10 years, Professor Amit Singer’s group has developed the ASPIRE Matlab package for proposing many new ideas in various numerical algorithms on different functionalities including CTF estimation, particle picking, denoising, 2D and 3D classification, and ab initio 3D reconstruction. The original Matlab package includes more than 100,000 lines of code but was developed in a very free style.  A reusable and sustainable Python package is an urgent need for long-term research goals in the group and application usage by external groups.  

Contribution

Complete redevelopment of ASPIRE in Python is a multiple-year project.  My major contributions from 2019 to 2020 are listed as: 1) Redesigned the framework of whole package using OOP style in collaboration with other RSEs and researchers. 2) Converted and reorganized the Matlab Cov2D code into Python for denoising 2D images. 3) Included several new methods (Fast FB, direct and fast PSWF) for image expansion and unified the image expanding process. 4) Brought in the orientation estimation submodule for ab initio reconstruction. 5) Unified the pipeline of preprocessing raw images. 6) Included GPU computing option for the Cov2D denoising and APPLE particle picking submodules.

Code available at: github.com/ComputationalCryoEM/ASPIRE-Python


INTERSECT-RSE

  

INTERSECT RSE Training

Ian Cosden, Director, Research Software Engineering for Computational & Data Science

Software forms the backbone of much current research in a variety of scientific and engineering domains. The breadth and sophistication of software skills required for modern research software projects is increasing at an unprecedented pace. Despite this fact, an alarming number of researchers who develop software do not have adequate training in software development. Therefore, it is imperative for researchers who develop the software that will drive tomorrow’s critical research discoveries to have access to software engineering training at multiple stages of their career, to not only make them more productive, but also to make their software more robust, reliable, and sustainable. INTERSECT (INnovative Training Enabled by a Research Software Engineering Community of Trainers) provides training on software development and engineering practices to research software developers who already possess an intermediate or advanced level of knowledge. INTERSECT, through training events and open-source material, is building a pipeline of computational researchers trained in best practices for research software development.

Project website: www.intersect-training.github.io