RSE Projects 

Research Software Engineers (RSE) work on various projects with their partner academic departments. A selected list of projects executed by RSEs are described below.

 

Solving real-world GitHub issues with AI: SWE-agent

RSE: Kilian Lieret

PI: Karthik Narasimhan, Department of Computer Science & Princeton Language and Intelligence

Background
SWE-agent Demo

Language model (LM) agents are increasingly being employed to automate complex tasks in digital environments. Just as specialized software tools assist human users with intricate work, LM agents also benefit from tailored interfaces that optimize their interactions with computers. SWE-agent is a system designed to solve real-life GitHub issues in an iterative fashion mimicking human developers. It enables LMs to autonomously navigate repositories, modify code, and execute commands. It was the first open-source AI agent to score a non-negligible score on SWE-bench, a software engineering benchmark based on more than 2000 real GitHub issues.

Contribution

As a developer and maintainer for the SWE-agent package, Kilian has sped up average execution time by a factor of 2x, introduced more than 100 integration tests, added a graphical user interface, and helped to extend the functionality to cybersecurity challenges. He has been researching and developing new strategies to further improve the performance of the agent and has actively contributed to the most recent publications. He has also helped with the development of the SWE-bench package.

 

SWE-agent is open-source and available at https://github.com/princeton-nlp/SWE-agent/


Development of ASPIRE Python Package

RSE: Josh Carmichael

ASPIRE Figure

PI: Amit Singer, Program in Applied and Computational Mathematics 

Background

ASPIRE (Algorithms for Single Particle Reconstruction) is an open-source Python library which aims to provide a pipeline for processing cryo-EM data. It represents years of cumulative work in mathematics, signal processing, and algorithm design by researchers and students from Professor Amit Singer’s group at Princeton. The ASPIRE RSE team is unifying  those efforts into a software framework that can be used by the cryo-EM community at large: theoreticians and experimentalists alike. The package implements significant advances made by Professor Singer and colleagues in the cryo-EM field. This includes contributions to image denoising and correction, particle-picking, class-averaging, and 3D volume estimation. Collaboration with the Flatiron Institute on the FINUFFT non-uniform fast Fourier transform tool has been critical, as this algorithm is at the core of ASPIRE code.

Contribution

Josh’s contributions include refactoring code for generating simulated molecules to include molecules with cyclic symmetry, porting methods for ab initio reconstruction of cyclically symmetric molecules from MATLAB to Python, and adding a Sphinx-Gallery extension to ASPIRE’s documentation to include example scripts demonstrating the functionality of various components of the ASPIRE software package. 


View ASPIRE’s documentation at: computationalcryoem.github.io/ASPIRE-Python

This project is supported by the Gordon and Betty Moore Foundation.


HydroGEN

HydroGEN Project Picture

RSE: Amy Defnet, Bill Hasling

PI: Laura Condon (University of Arizona)

Co-PI: Reed Maxwell, Civil and Environmental Engineering (Princeton University)

Background

HydroGEN is a web-based machine learning (ML) platform to generate custom hydrologic scenarios on demand. It combines powerful physics-based simulations with ML and observations to provide customizable scenarios from the bedrock through the treetops. Without any prior modeling experience, water managers and planners can directly manipulate state-of-the-art tools to explore scenarios that matter to them. HydroGEN is funded by a National Science Foundation grant as a joint project with Princeton University and University of Arizona.

Contribution

Created the software architecture for the web-based application in consultation with CyVerse, which is another partner with University of Arizona. Implementation is a microservice-based architecture using Docker components and a NATS message bus designed to be flexible and portable to other data centers, if needed. We use Keycloak and OAuth 2.0 security for login and secure REST-based APIs. Deployment is handled with Kubernetes, and the user interface is developed with React and Material Design. Created a flexible model for web-based visualizations and established good software engineering practices for logging, unit testing, code quality and development/QA/production environments. Created python-based data ingestion pipelines to collect, clean, and locally store external observations data via several government agencies' APIs. This data will be used as input to models that develop novel approximations of features for selected watersheds. Through the use of database tables, newly-available data can be regularly queried and metadata about the locally-stored data can be easily obtained.


ProtDomain

Binding scores for the CTCF protein for different ligand types. Higher bars indicate higher binding affinity at respective positions along the protein.

Binding scores for the CTCF protein for different ligand types. Higher bars indicate higher binding affinity at respective positions along the protein.

RSE: Vineet Bansal

PI: Mona Singh, Department of Computer Science (Genomics)

Background

Over the years, students at "SinghLab" under Prof. Mona Singh have developed several algorithms for "domain" identification on protein sequences. Domains are segments in the protein chain that have largely evolved independently, internally maintain their structure, and are thus useful units for analyses. Further, multiple students at SinghLab have developed algorithms that are able to identify regions within a domain that are ripe for binding (with ligands). This allows practitioners in the field to target regions of the protein that are most likely to be susceptible to a reaction. Prof. Singh wanted an integrated web application that allowed users to dine a la carte on these several approaches developed over the years. This effort would also help polish and document code developed by graduate students.

Contribution

We took several independent codebases developed over time, some in Python and others in Perl, and developed an integrated Web Portal for Protein Domain analysis. This allows users to run these algorithms on their protein sequences, with no programming or infrastructure requirements. The web application is hosted at Research Computing at Princeton. In the process of developing this web application, we also streamlined the data-processing pipeline, making it easier for future SinghLab researchers to add their own algorithms for domain identification and ligand-binding scoring.  

Website: protdomain.princeton.edu


HydroFrame

HydroFrame Figure

RSE: George Artavanis, Calla E. Chennault (2020-2022)

PI: Reed Maxwell, Civil and Environmental Engineering

Background

HydroFrame (hydroframe.org) is a platform for interacting with national hydrologic simulations. The goal of the platform is to simplify user interaction with large, computationally intensive hydrologic models and their massive simulated outputs. Currently, HydroFrame’s Beta Release allows users to subset inputs from national models, generate a model domain and a run script from these inputs, and run a small runoff test. The next version of HydroFrame aims to enhance its model subsetting options and model outputs, as well as provide more extensive and customizable simulations with more flexible analysis/visualization.

Contribution

Implemented a web endpoint on Princeton hardware which will connect to the existing HydroFrame subsetter and allow users to select from a range of workflows concerning their watershed. After launching the endpoint from the subsetter, users will be able to: launch and run a pre-populated, customizable ParFlow model; interact with model outputs and modify inputs to launch a rerun; review previous runs and their parameter specifications; and launch a Jupyter notebook. This web interface will help to remove initial barriers for use and development of national water models.  


ASPIRE - NUFFT

RSE: Garrett Wright

PI: Amit Singer, Department of Mathematics; Alex Barnett, Flatiron Institute

Background

Underlying ASPIRE algorithms, the Non Uniform Fast Fourier Transform is a core numerical method dominating computational time in current applications. ASPIRE directly depends on external packages to provide portable and validated high performance solutions for this method. To facilitate this, we have been collaborating closely with the Flatiron Institute, home to the state of the art open source FINUFFT implementation. In this collaboration, PACM has contributed directly to a highly optimized CPU package FINUFFT and a CUDA GPU based cuFINUFFT.

Contribution

Contributions include refactoring and developing the C/C++/CUDA code to support the following features: build system abstractions, dual precision support, a new API for cuFINUFFT, python C bindings, pip packaging, creation of (CUDA backed) binary distribution wheels, and initiating efforts for automated CI/CD. Early drafts of this work were leveraged by the ASPIRE team at the 2020 Princeton GPU Hackathon to yield speedups from 2x-10x as a proof of concept. Results of this collaboration are fully integrated in ASPIRE-Python as of v0.6.2.  

Packages are proudly open source and can be found here:

github.com/flatironinstitute/finufft

github.com/flatironinstitute/cufinufft

github.com/ComputationalCryoEM/ASPIRE-Python

This project is supported by the Gordon and Betty Moore Foundation.


IBDmix Figure

IBDmix

RSE: Troy Comi

PI: Josh Akey, The Lewis-Sigler Institute of Integrative Genomics 

Background

Admixture has played a prominent role in shaping patterns of human genomic variation, including gene flow with now-extinct hominins like Neanderthals and Denisovans. IBDmix is a novel probabilistic method which identifies introgressed hominin sequences, which unlike existing approaches, does not use a modern reference population.   

Contribution

I fully refactored the exploratory codebase to utilize modern C++, build systems (Cmake), and unit testing with google test. The original algorithm was replaced with a streaming implementation, leveraging a custom stack class which is tuned for the rapid push/pop cycles to limit object creation. Overall, runtimes were kept fast while decreasing memory usage from O(n) to O(1). Outputs from the original code are utilized for regression, acceptance tests, which are run with github actions for each push. The entire workflow is encapsulated in a snakemake pipeline to demonstrate how each component interacts and to reproduce the published dataset.

Code available at: github.com/PrincetonUniversity/IBDmix


Cell Patch Polarity Heatmaps

RSE: Abhishek Biswas

PI: Danelle Davenport, Department of Molecular Biology 

Background

Tissue Analyzer, an ImageJ plugin, is used by the researchers in Davenport Lab to process confocal tissue images and generate cell segmentation masks and cell polarities. The cell polarities can be used by the tool PackAttack2.0 for generating a plot of the polarity orientations for the whole image. However, for certain types of analysis the researchers wanted to visually show local polarity hotspots for cell patches of various diameters.   

Contribution

Implemented cell polarity visualization over multiple concentric cell patches in PackAttack2.0. The local polarity hot spots in the images of fluorescently labeled cells of the epidermis can now be clearly shown as heatmaps and help answer questions about changes in local cell polarity. The images below show the local cell polarity heatmaps for cell patches of diameter 1, 2 and 4 cells. The high polarity hotspots can be clearly seen in stronger shades of red.

Cell Polarity Heatmap Win 1
Cell Polarity Heatmap Win 2
Cell Polarity Heatmap Win 4

 


INTERSECT RSE Training

RSE: Ian Cosden

Software forms the backbone of much current research in a variety of scientific and engineering domains. The breadth and sophistication of software skills required for modern research software projects is increasing at an unprecedented pace. Despite this fact, an alarming number of researchers who develop software do not have adequate training in software development. Therefore, it is imperative for researchers who develop the software that will drive tomorrow’s critical research discoveries to have access to software engineering training at multiple stages of their career, to not only make them more productive, but also to make their software more robust, reliable, and sustainable. INTERSECT (INnovative Training Enabled by a Research Software Engineering Community of Trainers) provides training on software development and engineering practices to research software developers who already possess an intermediate or advanced level of knowledge. INTERSECT, through training events and open-source material, is building a pipeline of computational researchers trained in best practices for research software development.

Project website: www.intersect-training.github.io

 


Safely Report

RSE: Sangyoon Park

PI: Sylvain Chassang (Princeton University Department of Economics) 

Co-PI: Laura Boudreau (Columbia Business School)

Background

Survey participants often feel reluctant to share their true experience because they are worried about potential retaliation in case their responses are identified (e.g., data leakage). This is especially the case for sensitive survey questions such as those asking about sexual harassment in the workplace. As a result, survey administrators (e.g., company management, researchers) often get inaccurate representation of the reality, which makes it hard to devise an appropriate course of action.


Safely Report is a survey web application that can provide plausible deniability to survey respondents by recording survey responses with noise. For instance, when asking a worker whether they have been harassed by a manager, the application can be set up to record the answer "yes" with a probability of 30% even if the worker responds "no". This makes it nearly impossible to correctly identify which responses (of all those recorded "yes") are truthful reports — even if the survey results are leaked. Yet, the survey designer can still know the proportion and other statistics of truthful reports because the application tracks the number of cases (but not the cases themselves) where noise injection has happened. Consequently, survey participants feel more safe and become more willing to share their true experience, which has been confirmed by a relevant study.

Safely report flow chart.
Contribution

Safely Report aims to provide interested researchers with an open source tool (available under an MIT license) that implements secure survey techniques developed by Sylvain Chassang (Princeton) and Laura Boudreau (Columbia) such that the researchers can more easily adapt and use these techniques in their own research. The software supports XLSForm, which is an Excel-based survey specification standard widely used by researchers to design and conduct complex surveys, so it can well integrate into the existing user base.

Safely Report offers several advantages over existing XLSForm-compliant survey tools:

  1. New Security Features. Foremost, it supports the novel techniques for secure survey, which are more difficult to implement in other survey tools.
  2. Technically Accessible. It is a lightweight Python-based application, so researchers may adapt and deploy it fully on their own.
  3. Free to Use. It is completely free unlike some other survey tools that operate under paid plans (e.g., SurveyCTO).

The software is under active development at the moment and is planned to be open sourced in May 2024.

 


Line-Segment Tracking

Line-Segment Tracking Schematic

RSE: Andres Rios Tascon

PI: Peter Elmer, Department of Physics

 

Background

Charged particle track reconstruction is one of the most computationally expensive steps during the processing of the raw data from the CMS experiment at CERN. The High Luminosity upgrade of the Large Hadron Collider (HL-LHC) will produce particle collisions that generate an unprecedented number of charged particles visible in the detector. Their trajectories need to be reconstructed from signals left on arrays of discrete sensors, a problem which grows combinatorially as the number of particles increases.The increased complexity is expected to surpass the projected computing budget from CPUs, and hence a different approach is needed. Line-Segment Tracking is a new algorithm that is designed with massive parallelism in mind and aims to run on GPUs. This algorithm has already been shown to achieve similar accuracy and better timings compared to existing algorithms.

 

Contribution

Contributions include refactoring, validating the code on edge cases, implementing safety checks and convenience features, and developing a CI workflow to improve code quality and keep a better record of performance changes over time. Work is being done towards integrating the software with the application framework used by the CMS collaboration.

 


Simons Observatory Project

RSE: Ioannis Paraskevakos

PI: Jo Dunkley (Princeton University, Department of Physics)

Background

The Simons Observatory is a ground based cosmic microwave background (CMB), the heat left from the early days of the universe, experiment situated above the Atacama Desert in Chile. It will make precise and detailed observations of the CMB. It will provide discoveries in fundamental physics, cosmology and astrophysics.

Contributions

The latest novel algorithms for creating CMB maps from the observed data are compute and memory intensive utilizing Princeton and other supercomputers to execute on a reasonable amount of time. The RSE will contribute in the software systems that SO uses to create CMB maps. Specifically, the RSE will be responsible to parallelize the algorithms and workflows so that they execute efficiently and effectively on the computing resources the project has access to.


Multi-tissue Somatic Mutation Detection

Multi-tissue Somatic Mutation Detection

RSE: Rob Bierman

PI: Josh Akey, The Lewis-Sigler Institute of Integrative Genomics

Background

Germline mutations are present in the DNA of every cell of the body, but somatic mutations occur throughout a person’s lifetime and exist in only a subset of tissues. Historically, somatic mutations have been identified by comparing a cancerous tissue with a matched normal control. Increasingly complex and massive multi-tissue datasets, however, require a novel probabilistic model for somatic mutation detection.

Contribution

The original code for this project used R, python, and numerous external dependencies that the user was tasked with managing. I refactored the existing codebase to create a python package with a command-line interface within a Docker container to manage the external dependencies and increase reproducibility, portability, and ease of use. Lightweight unit tests of the python code are performed with pytest, while expensive integration tests with external dependencies are performed with a pytest entrypoint of the Docker container. Both sets of tests are automatically run using Github Actions and this refactoring resulted in 3X runtime speedups.

 


SPECFEM++ - A modular and portable spectral-element code for seismic wave propagation

Research Featured Image Pacific Mantle
(a.) View of the mantle below the Pacific, warm colors denote slower than average seismic wavespeeds. Cold colors denote faster-than-average seismic wavespeeds associated with subduction zones. (b.) With three-dimensional tomography, scientists can isolate vertical as well as horizontal slices of the mantle. (Image courtesy of Ebru Bozdağ, Colorado School of Mines, and David Pugmire, Oak Ridge National Laboratory.)

RSE: Rohit Kakodkar

PI: Prof. Jeroen Tromp (Department of Geosciences)

Background

SPECFEM represents a suite of high-performance computational applications used to simulate seismic wave propagation through heterogeneous media and for doing adjoint tomography. Through the years, SPECFEM has been developed as a set of 3 Fortran packages (SPECFEM2D, SPECFEM3D, and SPECFEM3D_GLOBE) with partial support for GPUs (NVIDIA and AMD). This project aims to unify the 3 SPECFEM packages while providing a performance portable backend for current and future architectures. To do this, we intend to develop a performance-portable spectral element method framework, SPECFEM++, that can be used to write spectral element solvers in a dimensionally independent manner. 

Contribution

To achieve the stated goals of SPECFEM++, I’ve implemented a template-based object-oriented modular framework in C++, making it easy for potential developers to extend the package by adding new physics or methods. For performance-portability, I use the Kokkos programming model, which enables us to describe our parallelism in an architecture-independent manner. The work until now lays a solid groundwork for achieving the stated goals of SPECFEM++. 


Code available at: github.com/PrincetonUniversity/specfem2d_kokkos

 


GenX

Zero Lab Logo

RSE: Luca Bonaldo

PI: Prof. Jesse D. Jenkins, Department of Mechanical and Aerospace Engineering and the Andlinger Center for Energy and Environment (Princeton University)

Background

The global electricity system is undergoing a significant transformation due to national and global efforts to reduce carbon emissions. The deployment of variable renewable energy (VRE), energy storage, and innovative uses for

Zero Lab Projects

distributed energy resources (DERs) are only some examples of new technologies that are reshaping the electricity sector. In response, researchers at Princeton and MIT have developed GenX, an open-source, highly configurable tool to offer improved decision support capabilities for a changing electricity landscape. GenX takes the perspective of a centralized planner to determine the cost-optimal generation portfolio, energy storage, and transmission investments needed to meet a pre-defined system demand while adhering to various technological and physical grid operation constraints, resource availability limits, and other imposed environmental, market design, and policy constraints.

Contribution

The software is available on GitHub at github.com/GenXProject/GenX, and it is under active development to include the latest technologies and policies. Contributions include refactoring parts of the codebase and the documentation and adding support to the maintenance and testing of the software.

 


ModECI Model Description Format (MDF)

ModECI Logo

RSE: David Turner

PI: Jon Cohen, Princeton Neuroscience Institute; Padraig Gleeson, University College London

Background

MDF is an open source, community-supported standard and associated library of tools for expressing computational models in a form that allows them to be exchanged between diverse programming languages and execution environments. The overarching aim is to provide a common format for models across computational neuroscience, cognitive science and machine learning.

It consists of a specification for expressing models in serialized formats (currently JSON, YAML and BSON representations are supported, though others such as HDF5 are planned) and a set of Python tools for implementing a model described using MDF. The serialized formats can be used when importing a model into a supported target environment to execute it; and, conversely, when exporting a model built in a supported environment so that it can be re-used in other environments.

The MDF Python API can be used to create or load an MDF model for inspection and validation. It also includes a basic execution engine for simulating models in the format. However, this is not intended to provide a efficient, general-purpose simulation environment, nor is MDF intended as a programming language. Rather, the primary purpose of the Python API is to facilitate and validate the exchange of models between existing environments that serve different communities. Accordingly, these Python tools include bi-directional support for importing to and exporting from widely-used programming environments in a range of disciplines, and for easily extending these to other environments.

Contributions

David contributed to the design and implementation of the JSON schema for MDF. He developed the JSON serialization\deserialization backend and ONNX execution engine. His most significant contribution was the implementation of the PyTorch to MDF import system which utilizes torch compilation to automatically convert PyTorch programs into MDF models with little or no code modification. This has enabled virtually automatic support of most torch compilable models into the MDF framework, currently this has been tested with over 60 models available in the torch vision package. Additionally, he also implemented the setup for testing, documentation building, packaging, and continuous integration.


Website: github.com/ModECI/MDF


PsyNeuLink

PsyNeuLink Logo

(pronounced: /sīnyoolingk - sigh-new-link)

RSE: David Turner

PI: Jon Cohen, Princeton Neuroscience Institute

Background

PsyNeuLink is an open-source, software environment written in Python, and designed for the needs of neuroscientists, psychologists, computational psychiatrists and others interested in learning about and building models of the relationship between brain function, mental processes and behavior.

PsyNeuLink can be used as a "block modeling environment", in which to construct, simulate, document, and exchange computational models of neural mechanisms and/or psychological processes at the subsystem and system levels. A block modeling environment allows components to be constructed that implement various, possibly disparate functions, and then link them together into a system to examine how they interact. In PsyNeuLink, components are used to implement the function of brain subsystems and/or psychological processes, the interaction of which can then be simulated at the system level.

The purpose of PsyNeuLink is to make it as easy as possible to create new and/or import existing models, and integrate them to simulate system-level interactions. It provides a suite of core components for implementing models of various forms of processing, learning, and control, and its Library includes examples that combine these components to implement published models. As an open source project, its suite of components is meant to be enhanced and extended, and its library is meant to provide an expanding repository of models, written in a concise, executable, and easy to interpret form, that can be shared, compared, and extended by the scientific community.

Contributions

David’s most significant contribution to the PsyNeuLink project has been the design and implementation of the parameter estimation and optimization system. This system is an implementation of likelihood-free estimation of model parameters using probability density approximation. The system allows users to fit model parameters to their data in a relatively user-friendly programming interface without needing to specify the closed form likelihood for their model. Additionally, he has developed GPU reference implementations of models for benchmarking performance of PsyNeuLink’s compilation system. Finally, he has contributed to the general software design in a collaborative setting with PsyNeuLink’s many developers over the years.

Website: github.com/PrincetonUniversity/PsyNeuLink


cryoDRGN

cryoDRGN Example

RSE: Michal R. Grzadkowski

PI: Ellen D. Zhong, Department of Computer Science

Background

Recent advances in cryogenic microscopy technology have allowed for an unprecedented ability to image molecules of interest in biological specimens. However, reconstructing a molecule’s three-dimensional structure from hundreds of thousands of noisy two-dimensional images — each representing an unknown orientation of the molecule — remains a computational challenge. The DRGN Lab for Molecular Machine Learning at Princeton has introduced cryoDRGN, a novel technique for applying neural networks to the problem of 3D reconstruction that allows for novel insights into protein structure and dynamics.

Contribution

As a maintainer and developer of the cryoDRGN software package, Michal is responsible for incorporating new features and methods without compromising existing functionality. He is also working on improving the cryoDRGN codebase through runtime optimizations, code refactoring, development of unit tests, and expanded documentation.

The cryoDRGN package is open-source and can be accessed at cryodrgn.cs.princeton.edu.

 

 


Automatic Speech Transcription via Audio Analysis and Large Language Models

RSE: Junying (Alice) Fang

Background
Automatic Speech Transcription via Audio Analysis and Large Language Models

Most models nowadays segment speech by speakers (speaker diarization) by analyzing the audio features without utilizing the textual information from speech. To increase the accuracy of speaker diarization and thus speaker identification after that, large language models are applied to identify speaker changes by understanding the interrelationships across text segments.

Contributions

The complete machine learning pipeline is built for users to directly pass audio/video as inputs to get the final transcription with timestamps and speaker identification. The API or web-based application would be built to provide transcription service through which users could convert speech to text without setting up any infrastructure and software from their end. For its future applications, we would provide fine-tune tools and are seeking collaboration to apply it to specific area in social science.

The project is supported by Data-Driven Social Science at Princeton.