Sessions

Session 1: Panel - DevOps & Cyberinfrastructure

This paper introduces CACAO, a research software platform that simplifies the use of cloud computing in scientific research and education. CACAO's cloud automation and continuous analysis features make it easy to deploy research workflows or laboratory experimental sessions in the cloud using templates. The platform has been expanding its support for different cloud service providers and improving scalability, making it more widely applicable for research and education. This paper provides an overview of CACAO's key features and highlights use cases.

Science gateways connect researchers to high-performance computing (HPC) resources by providing a graphical interface to manage data and submit jobs. Scientific research is a highly collaborative activity, and gateways can play an important role by providing shared data management tools that reduce the need for advanced expertise in file system administration. We describe a recently implemented architecture for collaborative file management in the Core science gateway architecture developed at the Texas Advanced Computing Center (TACC). Our implementation is built on the Tapis Systems API, which provides endpoints that enable users to securely manage access to their research data.

High-fidelity pattern of life (PoL) models require realistic origin points for predictive trip modeling. This paper develops and demonstrates a reproducible method using open data to match synthetic populations generated from census surveys to plausible residential locations (building footprints) based on housing attributes. This approach presents promise over extant methods based on housing density, particularly in small neighborhood areas with heterogeneous land-use.

This paper examines the potential of containerization technology, specifically the CyVerse Discovery Environment (DE), as a solution to the reproducibility crisis in research. The DE is a platform service designed to facilitate data-driven discoveries through reproducible analyses. It offers features like data management, app integration, and app execution. The DE is built on a suite of microservices deployed within a Kubernetes cluster, handling different aspects of the system, from app management to data storage. Reproducibility is ensured by maintaining records of the software, its dependencies, and instructions for its execution. The DE also provides a RESTful web interface, the Terrain API, for creating custom user interfaces. The application of the DE is illustrated through a use case involving the University of Arizona Superfund Research Center, where the DE's data storage capabilities were utilized to manage data and processes. The paper concludes that the DE facilitates efficient and reproducible research, eliminating the need for substantial hardware investment and complex data management, thereby propelling scientific progress.

Single-page applications (SPAs) have become indispensable in modern frontend development, with widespread adoption in scientific applications. The process of creating a single-page web application development environment which accurately reflects the production environment isn’t always straightforward. Most SPA build systems assume configuration at build time, while DevSecOps engineers prefer runtime config- uration. This paper suggests a framework-agnostic approach to address issues that encompass both development and deployment, but are difficult to tackle without knowledge in both domains.

Session 2: Panel - Sponsors

Globus, Lee Liming
ThinLinc, Robert Henschel
University of Arizona, Nirav Merchant
Google, Thomas Leung
Exosphere, Chris Martin

Session 3A: Culture & Community 1

Research software once was a heroic and lonely activity, particularly in research computing and in HPC. But today, research software is a social activity, in the senses that most software depends on other software, and that most software that is intended to be used by more than one person is written by more than one person. These social factors have led to generally accepted practices for code development and maintenance and for interactions around code. This paper examines how these practices form, become accepted, and later change in different contexts. In addition, given that research software engineering (RSEng) and research software engineers (RSEs) are becoming accepted parts of the research software endeavor, it looks at the role of RSEs in creating, adapting, and infusing these practices. It does so by examining aspects around practices at three levels: in communities, projects, and groups. Because RSEs are often the point where new practices become accepted and then disseminated, this paper suggests that tool and practice developers should be working to get RSE champions to adopt their tools and practices, and that people who seek to understand research software practices should be studying RSEs. It also suggests areas for further research to test this idea.

This presentation describes training activities carried out under the auspices of the U.S. Department of Energy’s; in particular, those facilitated by the Exascale Computing Project (ECP). While some of these activities are specific to members of ECP, others are beneficial to the HPC community at large. We report on training opportunities and resources that the broad Computer Science and Engineering (CS&E) community can have access to, the relevance of coordinated training efforts, and how we envision such efforts beyond ECP’s scope and duration.

The growing recognition of research software as a fundamental component of the scientific process has led to the establishment of both Open Source Program Office (OSPO) as a Research Software Engineering (RSE) groups. These groups aim to enhance software engineering practices within research projects, enabling robust and sustainable software solutions. The integration of an OSPO into an RSE group within a university environment provides an intriguing fusion of open source principles and research software engineering expertise. The utilization of students as developers within such a program highlights their unique contributions along with the benefits and challenges involved.

Engaging students as developers in an OSPO-RSE group brings numerous advantages. It provides students with valuable experience in real-world software development, enabling them to bridge the gap between academia and industry. By actively participating in open source projects, students can refine their technical skills, learn industry best practices, and gain exposure to collaborative software development workflows. Involving students in open source projects enhances their educational experience. They have the opportunity to work on meaningful research software projects alongside experienced professionals, tackling real-world challenges and making tangible contributions to the scientific community. This exposure to open source principles and practices fosters a culture of innovation, collaboration, and knowledge sharing.

This approach also raises questions. How can the objectives and metrics of success for an academic OSPO-RSE group be defined and evaluated? What governance models and collaboration mechanisms are required to balance the academic freedom of researchers with the community-driven nature of open source? How can the potential conflicts between traditional academic practices and the open source ethos be effectively addressed? How can teams balance academic commitments with project timelines? These questions highlight the need for careful consideration and exploration of the organizational, cultural, and ethical aspects associated with an OSPO acting as an RSE group within a university.

Leveraging student developers in an OSPO-RSE group also presents challenges that need careful consideration. Students may have limited experience in software engineering practices, requiring mentoring and guidance to ensure the quality and sustainability of the research software they contribute to. Balancing academic commitments with project timelines and expectations can also be a challenge, necessitating effective project management strategies and clear communication channels. Furthermore, the ethical considerations of involving students as developers in open source projects must be addressed, ensuring the protection of intellectual property, respecting licensing requirements, and maintaining data privacy.

The involvement of students as developers within an OSPO-RSE group offers valuable benefits. The effective integration of students in this context requires thoughtful planning, mentorship, and attention to ethical considerations. This talk will examine the experience of the Open Source with SLU program to explore the dynamic role of student developers in an OSPO-RSE program and engage in discussions on best practices, challenges, and the future potential of this distinctive approach to research software engineering within academia.

This talk presents research on the prevalence of research software as academic research output within international institutional repositories (IRs). This work expands on previous research, which examined 182 academic IRs from 157 universities in the UK. Very low quantities of records of research software were found in IRs and that the majority of IRs could not contain software as independent records of research output due to the underlying Research Information System (RIS) platform. This has implications for the quantities of software returned as part of the UK's Research Excellence Framework (REF), which seeks to assess the quality of research in UK universities and specifically recognises software as legitimate research output. The levels of research software submitted as research output have declined sharply over the last 12 years and this differs substantially to the records of software contained in other UK research output metadata (e.g. https://gtr.ukri.org). Expanding on this work, source data from OpenDOAR, a directory of global Open Access Repositories, were used to apply similar analyses to international IRs in what we believe is the first such census of its kind. 4,970 repositories from 125 countries were examined for the presence of software, along with repository-based metadata for potentially correlating factors. It appears that much more could be done to provide trivial technical updates to RIS platforms to recognise software as distinct and recordable research output in its own right. We will discuss the implications of these findings with a focus on the apparent lack of recognition of software as discrete output in the research process.

Session 3B: DevOps & Cyberinfrastructure

As social media platforms continue to shape modern communication, understanding and harnessing the wealth of information they offer has become increasingly crucial. The nature of the data that these platforms provide makes them the emerging resource for data collection to conduct research ranging from measuring sentiments of the people over a particular trend in society to drafting a major policy by governing agencies. This paper presents PULSE, a powerful tool developed by Decision TheaterTM (DT) at Arizona State University in the United States, designed to extract valuable insights from Twitter. PULSE provides researchers and organizations with access to a curated dataset of public opinions and discussions across diverse research areas. Further, the tool uses various machine learning and data analytical algorithms to derive valuable insights on the subject under research. These insights are efficiently displayed using an interactive dashboard to assist the researchers in extracting useful insights to draw appropriate conclusions. The paper also illustrates the technical functionalities and visualization capabilities of the tool with the case study on Hurricane Laura.

The Globus platform enables research applications developed by research teams to leverage data and compute services across many tiers of service—from personal computers and local storage to national supercomputing centers—with minimal deployment and maintenance burden. Globus is operated by the University of Chicago and is used by nearly all R1 universities, national labs, and supercomputing centers in the United States, as well as many smaller institutions.

In this talk, we’ll introduce the Globus Platform-as-a-Service, including how to register an application and how to access Globus APIs using our Python SDK. We will present examples of how the various Globus services, interfaces, and tools may be used to develop research applications. We will demonstrate authentication and access control with Globus’s Auth and Groups APIs; making data findable and accessible using Globus guest collections, data transfer API, and indexed Search API; and automating research with Globus Flows and Compute APIs.

Managing massive volumes of data and effectively making it accessible to researchers poses significant challenges and is a barrier to scientific discovery. In many cases, critical data is locked up in unwieldy file formats or one-off databases and is too large to effectively process on a single machine. This talk explores the role of Kubernetes, an open-source container orchestration platform, in addressing research data management challenges. I will discuss how we are using a set of publicly available open-source and home-grown tools in the National Renewable Energy Lab (NREL) Data, Analysis, and Visualization (DAV) group to help researchers overcome data-related bottlenecks.

The talk will begin by providing an overview of the data challenges faced in research data management, including data storage, processing, and analysis. I will highlight Kubernetes' ability to handle large-scale data by leveraging containerization and distributed computing, including distributed storage. Kubernetes allows researchers to encapsulate data processing infrastructure and workflows into portable containers, enabling reproducibility and ease of deployment. Kubernetes can then schedule and manage the resource allocation of these containers to enable efficient utilization of limited computing resources, leading to more efficient data processing and analysis.

I will discuss some limitations of traditional, siloed approaches to dealing with data and emphasize the need for solutions which foster collaboration. I will highlight how we are using Kubernetes at NREL to facilitate data sharing and cooperation among research teams. Kubernetes' flexible architecture enables the deployment of shared computing environments, such as Apache Superset, where researchers can seamlessly access and analyze shared datasets. Providing the ability to have one research team easily consume data generated by another, utilizing Kubernetes' as a central data platform, is one of the major wins we’ve encountered by adopting the platform.

Finally, I will showcase real-world use cases from NREL where we have used Kubernetes to solve some persistent data challenges involving large volumes of sensor and monitoring data. I will discuss the challenges we encountered when creating our cluster and making it available as a production-ready resource. I will also discuss the specific suite of tools, including Postgres and Apache Druid for columnar and timeseries data, and Redpanda Kafka for streaming data we have deployed in our infrastructure, and the process that went into the selection of these tools.

Attendees of this talk will gain insights into how Kubernetes can address data challenges in research data management. This talk aims to provide researchers with a framework and a set of building blocks which have worked well for us, in order for them to unlock the full potential of their data in the era of software-enabled discovery.

Conda is a multi-platform and language agnostic packaging and runtime environment management ecosystem. This talk will briefly introduce the conda ecosystem and the needs it meets, and then focus on work and enhancements from the past 2 years. This includes speed improvements in creating packages and in deploying them; work to establish and document standards that broaden the ecosystem; new support for plugins to make conda widely extensible; and a new governance model that incorporates the broader conda community.

The conda ecosystem enables software developers to create and publish easy to install versions of their software. Conda software packages incorporate all software dependencies (written in any language) and enable users to install software and all dependencies with a single command. Conda is used by over 30 million users worldwide. Conda packages are available in both general and domain specific repositories. conda-forge is the largest conda-compatible package repository with over 20,000 available packages for Linux, Windows, and macOS. Bioconda is the largest domain specific repository with over 8,000 packages for the life sciences.

Conda also enables users to set up multiple parallel runtime environments, each running different combinations and versions of software, including different language versions. Conda is used to support multiple projects that require conflicting versions of the same software package.

Session 4: Panel - Research & Development Tooling 1

Community resilience assessment is critical for the anticipation, prevention and mitigation of natural and anthropic disaster impacts. In the digital age, this requires reliable and flexible cyberinfrastructure capable of supporting research and decision processes along multiple simultaneous, interconnected concerns. To address this need, the National Center for Supercomputing Applications (NCSA) developed the Interdependent Networked Community Resilience Modeling Environment (IN-CORE) as part of the NIST-funded Center of Excellence for Risk-Based Community Resilience Planning (CoE), headquartered at Colorado State University. The Community App is a web-based application that takes a community through the resilience planning process using IN-CORE analyses for the measurement science to measure community resilience. Complex workflows are managed by DataWolf, a scientific workflow management system, running jobs on the IN-CORE platform utilizing the underlying Kubernetes cluster resources. Using the community app, users can perform realistic and complex scenarios and visualize the results to understand their resilience to different hazards and enhance their decision-making capabilities.

The INTERSECT Software framework project aims to create an open federated library that connects, coordinates, and controls systems in the scientific domain. It features the Adapter, a flexible and extensible interface inspired by the Adapter design pattern in object-oriented programming. By utilizing Adapters, the INTERSECT SDK enables effective communication and coordination within a diverse ecosystem of systems. This adaptability facilitates the execution of complex operations within the framework, promoting collaboration and efficient workflow management in scientific research. Additionally, the generalizability of Adapters and their patterns enhances their utility in other scientific software projects and challenges.

Full-stack research software projects typically include several components and have many dependencies. New projects benefit from co-development of these components within a well-structured monolith. While this is preferred, over time this can become a burden to deploy in different contexts and environments. What we would like is to independently deploy components to reduce size and complexity. Maintaining separate packages however allows for developmental drift and other problems. So called 'monorepos' allow for the best of both approaches, but not without its own difficulties. There is almost no formal treatment in the literature of this particular dilemma however. The technology industry has started using monorepos to solve similar challenges, but perhaps in the academic context we should be cautious to not simply replicate industry practices. This short paper merely propositions the research software engineering (RSE) community into a discussion of the positives and negatives in structuring projects as monorepos of discrete packages.

Documentation is a crucial component of software development that helps users with installation and usage of the software. Documentation also helps onboard new developers to a software project with contributing guidelines and API information. The INTERSECT project is an open federated hardware/software library to facilitate the development of autonomous laboratories. A documentation strategy using Sphinx has been utilized to help developers contribute to source code and to help users understand the INTERSECT Python interface. Docstrings as well as reStructuredText files are used by Sphinx to automatically compile HTML and PDF files which can be hosted online as API documentation and user guides. The resulting documentation website is automatically built and deployed using GitLab runners to create Docker containers with NGINX servers. The approach discussed in this paper to automatically deploy documentation for a Python project can improve the user and developer experience for many scientific projects.

Collaboration networks for university research communities can be readily rendered through the interrogation of coauthorships and coinvestigator data. Subsequent computation of network metrics such as degree or various centralities offer interpretations on collaborativeness and influence and can also compose distributions which can be used to contrast different cohorts. In prior work, this workflow provided quantitative evidence for ROI of centralized computing resources in contrasting researchers with and without cluster accounts, where significance was found across all metrics. In this work, two similar cohorts, those with RSE-type roles at the university and everyone else, are contrasted in a similar vein. While a significantly higher degree statistic for the RSE cohort suggests its collaborative value, a significantly lower betweenness centrality distribution indicates a target for potential impact through the implementation of a centralized RSE network.

Session 5: Panel - Sponsors

Princeton University, Ian Cosden
Los Alamos National Laboratories, Angela Herring
Jetstream2, Zachary Graber
Sandia National Laboratories, Miranda Mundt
National Center for Supercomputing Applications, Kenton McHenry

Session 6A: Software Engineering Practice

Evidence-based practice (EBP) in software engineering aims to improve decision-making in software development by complementing practitioners' professional judgment with high-quality evidence from research. We believe the use of EBP techniques may be helpful for research software engineers (RSEs) in their work to bring software engineering best practices to scientific software development. In this study, we present an experience report on the use of a particular EBP technique, rapid reviews, within an RSE team at Sandia National Laboratories, and present practical recommendations for how to address barriers to EBP adoption within the RSE community.

Sandia National Laboratories is a premier United States national security laboratory which develops science-based technologies in areas such as nuclear deterrence, energy production, and climate change. Computing plays a key role in its diverse missions, and within that environment, Research Software Engineers (RSEs) and other scientific software developers utilize testing automation to ensure quality and maintainability of their work. We conducted a Participatory Action Research study to explore the challenges and strategies for testing automation through the lens of academic literature. Through the experiences collected and comparison with open literature, we identify these challenges in testing automation and then present strategies for mitigation grounded in evidence-based practice and experience reports that other, similar institutions can assess for their automation needs.

We provide an overview of the software engineering efforts and their impact in QMCPACK, a production-level ab-initio Quantum Monte Carlo open-source code targeting high-performance computing (HPC) systems. Aspects included are: (i) strategic expansion of continuous integration (CI) targeting CPUs, using GitHub Actions runners, and NVIDIA and AMD GPUs in pre-exascale systems, using self-hosted hardware; (ii) incremental reduction of memory leaks using sanitizers, (iii) incorporation of Docker containers for CI and reproducibility, and (iv) refactoring efforts to improve maintainability, testing coverage, and memory lifetime management. We quantify the value of these improvements by providing metrics to illustrate the shift towards a predictive, rather than reactive, sustainable maintenance approach. Our goal, in documenting the impact of these efforts on QMCPACK, is to contribute to the body of knowledge on the importance of research software engineering (RSE) for the sustainability of community HPC codes and scientific discovery at scale.

Research software plays a crucial role in advancing research across many domains. However, the complexity of research software often makes it challenging for developers to conduct comprehensive testing, which leads to reduced confidence in the accuracy of the results produced. To address this concern, developers have employed peer code review, a well-established software engineering practice,to improve the reliability of software. However, peer code review is less prevalent in research software than in open-source or traditional software domains. This presentation addresses this topic by describing a previous investigation of peer code review in research software. Then it concludes with a description of our current work and ways for interested people to get involved.

In our previous study, we interviewed and surveyed 84 developers of research software.The results show research software teams do perform code reviews, albeit without a formalized process, proper organization, or adequate human resources dedicated to conducting reviews effectively. In the talk, we will describe the results in more detail. The application of peer code review holds promise for improving the quality of research software, thereby increasing the reliability of research outcomes. Additionally, adopting peer code review practices enables research software developers to produce code that is more readable, understandable, and maintainable.

This talk will then briefly outline our current work to engage interested participants. Our current work focuses on peer code review processes as performed specifically by Research Software Engineers (RSEs). The research questions we aim to address in this ongoing study are: RQ1: What processes do RSEs follow when conducting peer code review?; RQ2: What challenges do RSEs encounter during peer code review?; and RQ3: What improvements are required to enhance the peer code review process for RSEs?

To answer these questions, we plan to conduct the following phases of the project:Phase 1: Surveying RSEs to Gain Insights into Peer Code Review Practices; Phase 2: Conducting Interviews and Focus Groups with RSEs to Explore Peer Code Review Experiences; Phase 3: Observational Study of RSE Peer Code Review

There are numerous places for members of the US-RSE community to get involved in our research. We will highlight these opportunities in the talk.

Scientific software plays a crucial and ever-growing role in various fields by facilitating complex modeling, simulation, exploration, and data analysis. However, ensuring the correctness and reliability of these software systems presents significant challenges due to their computational complexity, their explorative nature, and the lack of explicit specifications or even documentations. Traditional testing methods fall short in validating scientific software comprehensively – in particular for explorative software and simulation tools suffer from the Oracle Problem. In fact, Segura et al. show that scientific and explorative software systems are inherently difficult to test. In this context, metamorphic testing is a promising approach that addresses these challenges effectively. By exploiting the inherent properties within scientific problems, metamorphic testing provides a systematic means to validate the accuracy and robustness of scientific software while avoiding the challenges posed by the Oracle Problem. The proposed talk will highlight the importance of metamorphic testing in scientific software, emphasizing its ability to uncover subtle bugs, enhance result consistency, and show approaches for a more rigorous and systematic software development process in the scientific domain.

Numerical research data is often saved in file formats such as CSV for simplicity and getting started quickly, but challenges emerge as the amount of data grows. Here we describe the motivation and process for how we moved from initially saving data across many disparate files, to instead utilizing a centralized PostgreSQL relational database. We discuss our explorations into the TimescaleDB extension, and our eventual decision to use native PostgreSQL with a table-partitioning schema, to best support our data access patterns. Our approach has allowed for flexibility with various forms of timestamped data while scaling to billions of data points and hundreds of experiments. We also describe the benefits of using a relational database, such as the ability to use an open-source observability tool (Grafana) for live monitoring of experiments.

Software sustainability is critical for Computational Science and Engineering (CSE) software. Highly complex code makes software more difficult to maintain and less sustainable. Code reviews are a valuable part of the software development lifecycle and can be executed in a way to manage complexity and promote sustainability. To guide the code review process, we have developed a technique that considers cyclomatic complexity levels and changes during code reviews. Using real-world examples, this paper provides analysis of metrics gathered via GitHub Actions for several pull requests and demonstrates the application of this approach in support of software maintainability and sustainability.

Academic research collaborations involving members from multiple independent, often international institutions are inherently decentralized, and they encounter the same general challenges in sharing data and collaborating online as does the internet at large when there are no viable or desirable centralized solutions. At NCSA we have developed a free and open source full-stack cyberinfrastructure (CI) for research collaborations based on OpenStack and Kubernetes that is reproducible, flexible, portable, and sustainable. We are embracing the lessons and technology of the (re)decentralized web to curate a suite of open source tools and web services that prioritize data ownership to give researchers as much control as possible over their data, communications, and access controls. I will present the architecture of our framework as well as example use cases from existing projects ranging from astronomy to nuclear physics, showcasing the flexible deployment system and some collaborative tools and services for identity and access management, messaging, data sharing, code development, documentation, high-performance computing and more. By the end of the talk I hope you will have some appreciation for the value decentralized tech brings to the research enterprise and the potential for innovation that lies in the creative integration of existing federated and peer-to-peer applications and protocols.

Session 6B: Research & Development Tooling 2

Visual Studio Code (VSCode) has emerged as one of the most popular development tools among professional developers and programmers, offering a versatile and powerful coding environment. However, configuring and setting up VSCode to work effectively within the unique environment of a shared High-Performance Computing (HPC) cluster remains a challenge. This discusses the configuration and integration of VSCode with the diverse and demanding environments typically found on HPC clusters. We demonstrate how to configure and set up VSCode to take full advantage of its capabilities while ensuring seamless integration with HPC-specific resources and tools. Our objective is to enable developers to harness the power of VSCode for HPC applications, resulting in improved productivity, better code quality, and accelerated scientific discovery.

Machine learning is a groundbreaking tool for tackling high-dimensional datasets with complex correlations that humans struggle to comprehend. An important nuance of ML is the difference between using a model for interpolation or extrapolation, meaning either inference or prediction. This work will demonstrate visually what interpolation and extrapolation mean in the context of machine learning using astartes, a Python package that makes it easy to tackle in ML modeling. Many different sampling approaches are made available with astartes, so using a very tangible dataset - a fast food menu - we can visualize how different approaches differ and then train and compare ML models.

For researchers using R to do work in stylometry, the Stylo package in R is indispensable, but it also has some limitations. The Stylo2gg package addresses some of these limitations by extending the usefulness of Stylo. Among other things, Stylo2gg adds logging and replication of analyses, keeping necessary files and introducing a systematic way to reproduce past work. With visualization as its initial purpose, Stylo2gg also makes exploring stylometric data easy, providing options for labeling, highlighting subgroups, and double coding data for added legibility in black and white or in color. Finally, as hinted by the name, the conversion of graphics from base R into Ggplot2 changes the style of the output and introduces more options to extend analyses with many other packages and addons. The reproducible notebook shown here, `01-stylo2gg.qmd` or rendered in `01-stylo2gg.html`, walks through much of the package, including many features that were added in the past year.

Airtable is an increasingly popular format for entering and storing research data, especially in the digital humanities. It combines the simplicity of spreadsheet formats like CSV with a relational database’s ability to model relationships; enterers or viewers of the data do not need to have programming knowledge. The Center for Digital Research in the Humanities at the University of Nebraska uses Airtable data for two projects on which I work as a developer. African Poetics has data focusing on African poets and newspaper coverage of them, and Petitioning for Freedom has data on habeas corpus petitions and involved parties. At the CDRH, our software can take data in many formats, including CSV, and ingest it for an API based on Elasticsearch. This data is then available for search and discovery through web applications built on Ruby on Rails. The first step in ingesting the Airtable data into our system is to download it. I will cover the command line tools that can do this, the formats that can be downloaded (JSON turned out to be the most convenient), and the requirements for authentication. Python (aided by Pandas) can transform this data into the CSV format that our back end software. I will discuss how to rename and delete columns, change data back into JSON (which is sometimes more parsable by our scripts), and clean troublesome values like blanks and NaNs. One advancement of Airtable over CSV is join tables that have similar functionality to SQL databases. Incorporating them into other projects has particular challenges. When downloaded directly from Airtable, the join data is in a format that cannot be interpreted by humans or programs other than Airtable. But it can be converted (with the help of some processing within airtable) into formats that can be parsed by external scripts so that it can be human-readable. With these transformations, our software can use the table to populate API fields and parse it into arrays and hashes to replicate relationships within the data. Finally I will discuss the advantages and disadvantages of Airtable for managing data, from the perspective of a developer who uses the data on the front and back end of web applications.

Introduction: We are pleased to submit a proposal to the US-RSE to include Anotemos, a media annotation software developed by the GRIP (Grasping the Rationality of Instructional Practice), an education research lab at the University of Michigan. Anotemos is designed to enhance research analysis by providing an efficient media annotation solution. This proposal highlights the key features, and benefits of Anotemos, showcasing its relevance to the research community.

Background and Objectives: Anotemos addresses the challenges faced by researchers in analyzing multimedia data, enabling them to extract valuable insights efficiently. Anotemos aims to streamline and automate the annotation workflow, offering researchers a user-friendly interface and advanced features for seamless annotation.

Key Features and Benefits: Using Anotemos, researchers can create Commentaries that are centered around a Multimedia Item. Anotemos offers a range of features that set it apart: (i) Comprehensive Annotation: Researchers can annotate various media types, including images, videos, and audio, with the support of diverse customizable annotation types such as text, icons, drawings, bounding boxes, and audio recordings both on-screen and off-screen; (ii) Real-time Collaboration: Anotemos enables multiple researchers to collaborate in real-time simultaneously, fostering knowledge exchange, reducing redundancy, and improving productivity; (iii) Share & Publish: Anotemos offers both private and public sharing, enabling the secure sharing of Commentaries with select individuals or the wider research community; (iv) Customizable Workflows: Anotemos supports the creation of customized workflows by making it easier to create identical Commentaries, and manage different sets of collaborators using Commentary sections, enabling researchers to tailor the platform to their projects; (v) Analysis & Reports: Using Anotemos, users can perform an in-depth analysis of annotated data and generate comprehensive reports, providing valuable insights and facilitating data-driven decision-making in research projects; (vi) Integrations: Anotemos offers seamless integration with Learning Tools Interoperability (LTI), allowing users to easily embed and access Anotemos Commentaries within LTI-compliant learning management systems. Furthermore, Anotemos supports LaTeX code, empowering users to annotate and display mathematical equations and formulas with precision. Additionally, Anotemos can be embedded into Qualtrics Surveys enabling the researchers to collect and analyze the survey data along with Anotemos annotations.

Conclusion: We believe that Anotemos has the potential to significantly enhance research analysis by providing an efficient, user-friendly, and customizable media annotation solution. It is developed using the Angular-Meteor framework, leveraging its robustness, scalability, and real-time capabilities. The software is currently in beta testing, with positive feedback from researchers in diverse fields. Its advanced features make it an ideal tool for researchers across various domains. We request the consideration of Anotemos for presentation at the US-RSE Conference, where researcher software engineers can gain insights into its features, benefits, and potential impact on research projects.

Assembly and analysis of metagenomics datasets along with protein sequence analysis are among the most computationally demanding tasks in bioinformatics. ExaBiome project is developing GPU accelerated solutions for exascale era machines to tackle these problems at unprecedented scale. The algorithms involved in these software pipelines do not fit the typical portfolio of algorithms that are amenable to GPU porting, instead, these are irregular and sparse in nature which makes GPU porting a significant challenge. Moreover it is a challenge to integrate complex GPU kernels within a CPU optimized software infrastructure that depends largely on dynamic data structures. This talk will give an overview of development of sequence alignment and local assembly GPU kernels that have been successfully ported and optimized for GPU based systems and the integration of these kernels within Exabiome software stack for demonstrating unprecedented capability of solving scientific problems in bioinformatics.

In codes used to simulate multi-physics hydrodynamics, it is common for variables to reside on different parts of a mesh, or on different, but related, meshes. For example, in some codes all variables may reside on cell centers, while in others, scalars may reside on cell centers, vectors on cell faces and tensors on cell corners, etc. Further, different methods may be used for the calculation of derivatives or divergences of different variables, or for the interpolation or mapping of variables from one mesh to another. This poses a challenge for libraries of 3D physics models, where the physical processes have dependencies on space, and therefore, mesh dependency. For such libraries to be able to support a general set of multi-physics hydrodynamics host codes, they must be able to represent the physics in a way that is independent of mesh-related details. To solve this problem, we present a Multi-Mesh Operations (MMOPS) library for the mesh-agnostic representation of 3D physics.

MMOPS is a light-weight C++ abstraction providing an interface for the development of general purpose calculations between variables of different user-defined types, while deferring the specification of the details of these types to be provided by the host code. As an example, consider three variables, a, b, c representing a vector and two scalar quantities residing on the cell corners, cell centers and cell corners of a mesh, respectively. MMOPS provides a `class mapped_variable` to represent these variables, for which the host code provides an arbitrary compile-time tag indicating where the variable data resides on the mesh, and a mapping function method used to indicate how to map the variable from one part of the mesh to another, using tag dispatching under the hood. This way, we can perform operations using `a`, `b`, `c`, namely `mapped_variable` instantiations representing a, b, c, such as `c(i) = a(c.tag(), i, dir) + b(c.tag(), i)` where, `i` is an index representing the ith cell on `c`’s mesh, and `dir` is an index representing the desired directional component of vector `a`. In general, if either `a` or `b` have a different mesh representation than `c`, then they get mapped to `c`’s mesh using the mapping functions provided by the host code when constructing `a` and `b`, which can be different. In the above example, since `a` is on the same mesh as `c`, it doesn’t get mapped but instead is directly accessed, and since `b` lives on a different mesh than `c`, it gets mapped to `c`’s mesh, i.e. to cell corners.

We demonstrate that MMOPS provides a zero to nearly zero-cost abstraction of the de- scribed functionality, in the sense that it incurs in little to no performance penalty (depending on compiler) compared to a manual benchmark implementation of the same functionality. A description of the library, usage examples, and benchmarking tests will also be presented.

The High Throughput Discovery Lab at the Rosalind Franklin Institute aims to iteratively expand the reaction toolkit used in drug discovery to enable new regions of biologically relevant chemical space to be explored. The current reaction toolkit, that underpins traditional drug discovery workflows, has to date been dominated by a small number (<10) of reaction classes that have remained largely unchanged for 30 years. It has been argued that this has contributed to attrition and stagnation in drug discovery. We are working to create a semi-automated approach to explore large regions of chemical space to identify novel bioactive molecules. The approach involves creating arrays of hundreds of reactions, in which different pairs of substrates are combined. The design of subsequent reaction arrays is informed by the biological activity of the products that are obtained. However, the multi-step laboratory process to create, purify and test the reaction products introduces high requirements for data linkage, making data management incredibly important.

This talk focuses on how we built the data infrastructure using a mix of open-source and licensed technology. We will discuss how the Franklin aims to increase automation in data processing and how the technology we have implemented will make this possible.

Session 7A: ML/AI

Increasingly, scientific research teams desire to in- corporate machine learning into their existing computational workflows. Codebases must be augmented, and datasets must be prepared for domain-specific machine learning processes. Team members involved in software development and data maintenance, particularly research software engineers, can foster the design, implementation, and maintenance of infrastructures that allow for new methodologies in the pursuit of discovery. In this paper, we highlight some of the main challenges and offer assistance in planning and implementing machine learning projects for science.

Artificial intelligence (AI) and machine learning (ML) have been shown to be increasingly helpful tools in a growing number of use-cases relevant to scientific research, despite significant software-related obstacles. There exist large technical costs to setting up, using, and maintaining AI/ML models in production. This often prevents researchers from utilizing these models in their work. The growing field of machine learning operations (MLOps) aims to automate much of the AI/ML life cycle while increasing access to these models. This paper presents the initial work in creating a nuclear energy MLOps platform for use by researchers at Idaho National Laboratory (INL) and aims to reduce the barriers of using AI/ML in scientific research. Our goal is to promote the integration of the latest AI/ML technologies into researchers' workflows and create more opportunity for scientific innovation. In this paper we discuss how our MLOps efforts aim to increase usage and the impact of AI/ML models created by researchers. We also present several use-cases that are currently integrated. Finally, we evaluate the maturity of our project as well as our plans for future functionality.

Machine learning models, specifically neural networks, have garnered extensive recognition due to their remarkable performance across various domains. Nevertheless, concerns pertaining to their robustness and interpretability have necessitated the immediate requirement for comprehensive methodologies and tools. This scholarly article introduces the "Adversarial Observation" framework, which integrates adversarial and explainable techniques into the software development cycle to address these crucial aspects.

Industry practitioners have voiced an urgent need for tools and guidance to fortify their machine learning systems. Research studies have underscored the fact that a substantial number of organizations lack the necessary tools to tackle adversarial machine learning and ensure system security. Furthermore, the absence of consensus on interpretability in machine learning presents a significant challenge, with minimal agreement on evaluation benchmarks. These concerns highlight the pivotal role played by the Adversarial Observation framework.

The Adversarial Observation framework provides model-agnostic algorithms for adversarial attacks and interpretable techniques. Two notable methods, namely the Fast Gradient Sign Method (FGSM) and the Adversarial Particle Swarm Optimization (APSO) technique, have been implemented. These methods reliably generate adversarial noise, enabling the evaluation of model resilience against attacks and the training of less vulnerable models.

In terms of explainable AI (XAI), the framework incorporates activation mapping to visually depict and analyze significant input regions driving model predictions. Additionally, a modified APSO algorithm fulfills a dual purpose by determining global feature importance and facilitating local interpretation. This systematic assessment of feature significance reveals underlying decision rules, enhancing transparency and comprehension of machine learning models.

By incorporating the Adversarial Observation framework, organizations can assess the resilience of their models, address biases, and make well-informed decisions. The framework plays a pivotal role in the software development cycle, ensuring the highest standards of transparency and reliability. It empowers a deeper understanding of models through visualization, feature analysis, and interpretation, thereby fostering trust and facilitating the responsible development and deployment of AI technologies.

In conclusion, the Adversarial Observation framework represents a crucial milestone in the development of trustworthy and dependable AI systems. Its integration of robustness, interpretability, and fairness into the software development cycle enhances transparency and reliability. By addressing vulnerabilities and biases, organizations can make well-informed decisions, improve fairness, and establish trust with stakeholders. The significance of the framework is further underscored by the pressing need for tools expressed by industry practitioners and the lack of consensus on interpretability in machine learning. Ultimately, the Adversarial Observation framework contributes to the responsible development and deployment of AI technologies, fostering public trust and promoting the adoption of AI systems in critical domains.

The expected long-term outcome of our research is to determine the ways in which a focus on scientific model code quality can improve both scientific reliability and model performance. In this work, we focus on climate models which are complex software implementations of the scientific and mathematical abstractions of systems that govern the climate.

Climate models integrate the physical understanding of climate and produce simulations of climate patterns. The model code is often hundreds of thousands of lines of highly sophisticated software, written in high-performance languages such as Fortran. These models are developed with teams of experts, including climate scientists and high-performance computing developers. The sheer complexity of the software behind the climate models (which have evolved over many years) makes them vulnerable to performance, usability, and maintainability bugs that hinder performance and weaken scientific confidence. Understanding how the social structures behind the software and model function have been examined but the complex interactions of domain experts and software experts are poorly understood.

The expected short-term outcomes of our research are a) develop a set of software quality indicators for the climate models, providing model maintainers and users with specific areas to improve the model code; b) develop new techniques for analyzing large bodies of unstructured text to explain how users perceive a research software project’s capabilities and failings aligned with international efforts to improve the quality of scientific software.

We follow two main approaches. (1) Analytics: analysis of climate models and their artifacts. These artifacts include the software code itself, including the way the model is deployed and run with end users; the bug reports and other feedback on the software quality; and the simulation testing used to validate the model outputs. Our analysis will be incorporated into a Fortran analysis tool to identify Fortran quality issues automatically. This analysis is incomplete however without a clear understanding of the social context in which it was produced. (2) Social Context: we examine the socio-technical artifacts created around climate models. These include a) outputs from workshops, interviews, and surveys with stakeholders, including climate scientists and model software developers; b) explicit issue and bug reports recorded on the model, such as the fact the model is failing/crashing at a particular place; c) implicitly discussed software problems and feature requests from related discussion forums. We hypothesize that both approaches will help to identify technical debts in the climate models.

Manually eliciting software requirements from these diverse sources of feedback can be time-consuming and expensive. Hence, we will use state-of-the-art Natural Language Processing (NLP) approaches that require less human effort. The web artifacts, interviews, and scientific texts of climate models provide reflections on the Climate Community of Practice (CoP). We will use an unsupervised topic model, Latent Dirichlet Allocation (LDA) to study how different narratives (such as specific modules of climate models, research interests, and needs of the community members) of a climate CoP have evolved over a period.

LDA assumes that documents are mixtures of topics, and a topic is a probability distribution over words. The gist of a topic can be inferred from its most probable words. We focus on the discussion forum of the Community Earth System Model (CESM). CESM is a collaboratively developed fully coupled global climate model focused on computer simulations of Earth's climate states.

We infer 15 topics on around 7000 posts from the year 2003 to 2023 on the discussion forum using MALLET topic modeling library. We ask a domain expert to assign a label to a topic and use the labels to analyze the posts. We plot the proportions of words in a document assigned to a topic over a period. We observe certain trends over around 20 years. The topic of version management of CESM models gained attention from 2018 on-wards till 2021. There is a steady increase in discussion of the topic of Coupled Model Intercomparison Project (CMIP) and Shared Socio-economic Pathways (SSPs). Discussions about issues related to installing and setting up CESM models and parallel computing have declined since 2016 and 2020 respectively. Topics of source code related changes, ocean modeling, and errors encountered while running CESM models are discussed throughout the period. We also observe that questions related to parallel computing received fewer responses, while questions related to compiling CESM models received higher responses. These trends and insights can be used by the software engineering group of CESM to prioritize its actions to facilitate the use of the models.

We believe that this qualitative study scaffolded using topic models will complement climate science research by facilitating the generation of software requirements, onboarding of new users of the models, responding to new problems using solutions to similar problems, and preventing reinvention.

Session 7B: Research & Development Tooling 3

Alchemical free energy methods, which can estimate relative potency of potential drugs by computationally transmuting one molecule into another, have become key components of drug discovery pipelines. Despite many ongoing innovations in this area, with several methodological improvements published every year, it remains challenging to consistently run free energy campaigns using state-of-the-art tools and best practices. In many cases, doing so requires expert knowledge, and/or the use of expensive closed source software. The Open Free Energy project (https://openfree.energy/) was created to address these issues. The consortium, a joint effort between several academic and industry partners, aims to create and maintain reproducible and extensible open source tools for running large scale free energy campaigns.

This contribution will present the Open Free Energy project and its progress building an open source ecosystem for alchemical free energy calculations. It will describe the kind of scientific discovery enabled by the Open Free Energy ecosystem, approaches in the core packages to facilitate reproducibility, efforts to enhance scalability, and our work toward more community engagement, including interactions with both industry and with academic drug discovery work. Finally, it will discuss the long-term sustainability of the project as a hosted project of the Open Molecular Software Foundation, a 501(c)(3) nonprofit for the development of chemical research software.

Background: The NSAPH (National Studies on Air Pollution and Health) lab focuses on analyzing the impact of air pollution on public health. Typically, studies in Environmental Health require merging diverse datasets coming from multiple domains such as health, exposures, and population demographics. Exposure data is obtained from various sources and presented in different formats, posing challenges in data management and integration. The manual process of extracting and aggregating data is repetitive, inefficient, and prone to errors. Harmonizing formats and standards is difficult due to the mix of public and proprietary data sources. Reproducibility is crucial in scientific work, but the heavy computations involved in processing exposure data are sensitive to the computational environment, making different versions of components potentially affect the results. Additionally, while some exposure data are public datasets that are shareable for reproducible research, there are some exceptions of proprietary data, such as ESRI shapefile, that cannot be publicly shared, further complicating reproducibility efforts in NSAPH.

Aim: Our main objective is to leverage our expertise in bioinformatics to create a robust data platform tailored for aggregating exposure and spatial data. We are building a data platform that is focused on exposure data, such as pollution, temperature, humidity, and smoke. This platform incorporates a deployment system, package management capabilities, and a configuration toolkit to ensure compatibility and generalizability across various spatial aggregations, such as zip code, ZCTA (zip code tabulation areas), and counties. Through the development of this data platform, our aim is to streamline exposure data processing, enable efficient transformation of exposure data, and facilitate reproducibility in working with exposure data.

Methods: The methodology employed in this study utilizes the CWL (Common Workflow Language) programming language for the data pipeline. Docker is utilized for deployment purposes, while PostgreSQL serves as the data warehouse. Apache Superset is employed for data exploration and visualization. The study incorporates various types of data transformations, including isomorphic transformations for reversible conversions and union transformations for combining data from different sources. Additionally, rollups are performed to extract specific data elements, and approximations are used for imprecise conversions. In the case of tabular data, simple aggregations are conducted using SQL functions. For spatial operations, the methodology includes adjustable rasterization methods such as downscaling and resampling (nearest, majority). Furthermore, a data loader is developed to handle diverse spatial file types, including NetCDF, Parquet, CSV, and FST.

Results: We developed DORIEH (data for observational research in environmental health), a comprehensive system designed for the transformation and storage of exposure data. Within this system, we implemented various utilities, notably the gridMET (Grid-based Meteorological Ensemble Tool) data pipeline, which encompasses multiple crucial steps. These steps include retrieving shapefiles from an API, importing exposure spatial data, aggregating exposure data from grids into selected spatial aggregations (e.g ZCTA, county, zip codes) utilizing adaptable rasterization methods, and ultimately storing the transformed data into our dedicated database. Furthermore, we constructed a versatile data loader capable of handling diverse file types, while also incorporating parallelization techniques to enhance processing efficiency. Additionally, DORIEH provides basic visualization capabilities that serve as quality checks for the data.

Conclusion: Our study addresses the challenges faced by the NSAPH group in analyzing the impact of environmental exposures on health. By developing the DORIEH data platform and implementing various utilities, including the flexible and configurable spatial aggregation data pipeline and a flexible data loader, we have made significant progress in overcoming the limitations of manual data extraction and aggregation. The platform enables streamlined exposure data processing, efficient transformation, and storage, while ensuring compatibility and reproducibility in working with exposure and spatial data.

Accurately capturing the huge span of dynamical scales in astrophysical systems often requires vast computing resources such as those provided by exascale supercomputers. The majority of computational throughput for the first exascale supercomputers is expected to come from hardware accelerators such as GPUs. These accelerators, however, will likely come from a variety of manufacturers. Each vendors uses its low-level programming interface (such as CUDA and HIP) which can require moderate to significate code development. While performance portability frameworks such as Kokkos allow research software engineers to target multiple architectures with one code base, adoption at the application level lags behind. To address this issue in computational fluid dynamics, we developed Parthenon, a performance portable framework for block-structured adaptive mesh refinement (AMR), a staple feature of many fluid dynamics methods. Parthenon drives a number of astrophysical and terrestrial plasma evolution codes including AthenaPK, a performance portable magnetohydroynamics (MHD) astrophysics code based on the widely used Athena++ code. Running AthenaPK on Frontier, the world’s first exascale supercomputer, we explore simulations of galaxy clusters with feedback from a central supermassive black hole. These simulations further understanding of the thermalization of feedback in galaxy clusters. In this talk we present our efforts developing performance portable astrophysics codes in the Parthenon collaboration and our experience running astrophysics simulations on the first exascale supercomputer. LA-UR-23-26938

LINCC (LSST Interdisciplinary Network for Collaboration and Computing) is an ambitious effort to support the astronomy community by developing cloud-based analysis frameworks for science expected from the new Large Survey of Space and Time. The goal is to enable the delivery of critical computational infrastructure and code for petabyte-scale analyses, mechanisms to search for one-in-a-million events in continuous streams of data, and community organizations and communication channels that enable researchers to develop and share their algorithms and software.

We are particularly interested in supporting early science and commissioning efforts. The team develops software packages in different areas of interest to LSST; such as RAIL, a package for estimating redshift from photometry, and KMBOD, a package for detecting slowly moving asteroids. I will concentrate on our effort with LSDB and TAPE packages that focus on cross-matching and time-domain analysis. We are currently developing capabilities to: i) efficiently retrieve and cross-match large catalogs; ii) facilitate easy color-correction and recalibration on time-domain data from different surveys to enable analysis on long lightcurves; iii) provide custom functions that can be efficiently executed on large amounts of data (such as structure function and periodogram calculations); iv) enable large-scale calculation with custom, user-provided functions, e.g., continuous auto-regressive moving average models implemented in JAX.

LINCC is already supporting the community through an incubator program and a series of technical talks. I will present these efforts and show our current status, results, and code; and discuss the lesson learned from collaboration between scientists and software engineers!

Session 8: Culture & Community 2

Princeton University’s central Research Software Engineering (RSE) Group is a central team of research software engineers who work directly with campus research groups to create the most efficient, scalable, and sustainable research codes possible in order to enable new scientific and scholarly advances. As the Group has grown, it has evolved in numerous ways, with new partners across academic units, new partnership models and operational procedures, and a reshuffled internal organization. In the summer of 2023, the RSE Group further evolved by incorporating, for the first time, two formal programs for RSE interns and fellows.

We present an experience report for the inaugural RSE summer internship and fellowship programs at Princeton University. These two programs, separate but concurrently held during the summer of 2023, represented our first formal attempt to introduce currently enrolled students to the RSE discipline in a structured setting with assigned mentors and well-defined projects. The projects varied widely in nature, spanning academic units in mathematics, social sciences, machine learning, molecular biology, and high energy physics. The interns and fellows were exposed to a diverse range of RSE programming languages, software packages, and software engineering best practices.

The two programs, with eight total student participants, further spanned in-person and remote work, undergraduate and graduate students, and multiple continents. We report on the experience of the interns, fellows, and mentors, including lessons learned and recommendations for improving future programs.

The National Energy Research Scientific Computing Center, NERSC, is a U.S. Department of Energy high-performance computing facility, used by over 9000 scientists and researchers for novel scientific research. NERSC staff support and engage with users to improve the use of this resource. We have now begun working towards creating a community of practice consisting of NERSC users, staff, and affiliates to pool resources, knowledge and build networks across scientific fields. Our highly interdisciplinary users have expertise in research computing in their respective science fields, and as such, access to a collective knowledge in the form of a community allows resource sharing, peer-mentorship, peer teaching and collaboration opportunities. Thus, we believe the benefits of such a community could lead to improved scientific output due to better peer-support for technical and research related issues.

In order to prepare a community creation strategy, gain insight into user needs, and understand how a community of practice could support those needs, NERSC staff conducted focus groups with users in the Spring of 2023. The findings from these focus groups provided significant insight into the challenges users face in interacting with one another and even with NERSC staff, and will inform our next steps and ongoing strategy. This presentation will outline the current state of the NERSC user community, the methodology for running user focus groups, qualitative and quantitative findings, and plans for building the NERSC user community of practice based on these findings.

To address the lack of software development and engineering training for intermediate and advanced developers of research software, we present INnovative Training Enabled by a Research Software Engineering Community of Trainers (INTERSECT). INTERSECT, funded by NSF, provides expert-led training courses designed to build a pipeline of researchers trained in best practices for research software development. This training will enable researchers to produce better software and, for some, assume the role of Research Software Engineer (RSE).

INTERSECT sponsors an annual RSE-trainer workshop, focused on the curation and communal generation of research software engineering training material. These workshops connect RSE practitioner-instructors from across the country to leverage knowledge from multiple institutions and to strengthen the RSE community. The first workshop, held in 2022 laid the foundation for the format, structure, and content for the INTERSECT bootcamp, a multi-day, hands-on, research software engineering training event.

INTERSECT brings together RSE instructors from U.S. institutions to exploit the expertise of a growing RSE community. In July 2023, INTERSECT will sponsor a week-long bootcamp for intermediate and advanced participants from U.S. institutions. We will use an existing open-source platform to make the INTERSECT-curated material from the bootcamp available. The public availability of the material allows the RSE community to continually engage with the training material and helps coordinate the effort across the RSE-trainer community.

In this talk we will introduce the INTERSECT project. We will discuss outcomes of the 2023 bootcamp, including curriculum specifics, lessons learned, and long term objectives. We will also describe how people can get involved as contributors or participants.

Session 9: Panel - Funding Agencies

How can funding agencies and foundations support successful career paths for RSEs in academia and at national labs?

Speakers

Josh Greenberg, Alfred P. Sloan Foundation
Bill Spotz, Department of Energy
Ishwar Chandramouliswaran, National Institutes of Health
Juan Jenny Li, National Science Foundation

Sessions

Session 1: Panel - DevOps & Cyberinfrastructure

“CACAO: A Research Software Platform for the Cloud”, John Xu, Illyoung Choi, and Edwin Skidmore

“Tapis Systems Enable Collaborative Data Management in Science Gateways”, Jake Rosenberg, Salvador Tijerina, Dan Vernon, Steve Black, and Richard Cardone

“A Reproducible Method for Downscaling Synthetic Populations to Realistic Residential Locations”, Joseph Tuccillo

“CyVerse Discovery Environment: A Technological Solution to the Reproducibility Crisis in Scientific Research”, Samantha Boleyn, Paul Sarando, John Wregglesworth, Ian McEwen, Tony Edgin, and Sarah Roberts

Session 2: Panel - Sponsors

Session 3A: Culture & Community 1

“Experiences in how RSE community identity leads to improved research software practices”, Daniel S. Katz

“Training Efforts in DOE’s HPC Facilities: ECP and beyond”, Osni Marques and Yasaman Ghadar

“Academic OSPO & RSE Group: Engaging Students in Open and Collaborative Development”, Daniel Shown

“An international census of research software in institutional academic repositories”, Domhnall Carlin

Session 3B: DevOps & Cyberinfrastructure

“PULSE: Real-Time Monitoring and Analysis Framework Leveraging Twitter Data”, Ramesh Gorantla, Srinivasa Srivatsav Kandala, Abhishek Hondad, Fangwu Wei, Sesha Satya Pranathi Devi Pantham, and Vikash Bajaj

“Building Research Apps with Globus PaaS”, Susan Tussy and Lee Liming

“Conquering Data Chaos: Research Data Management with Kubernetes”, Struan Clark

“Package and Environment Management with conda”, Dave Clements

Session 4: Panel - Research & Development Tooling 1

“Developing IN-CORE Community App: Web Application for Community Resilience”, Christopher Navarro, Jong Lee, Chen Wang, Rob Kooper, Rashmil Panchani, Yong Wook Kim, Vismayak Mohanarajan, Ya-Lan Yang, and Lisa Gatzke

“Enabling Interconnected Science Workflows through an Adapter Approach”, Jesse McGaha, Marshall McDonnell, Lance Drane, Seth Hitefield, Gavin Wiggins, Gregory Cage, Robert Smith, Michael Brim, Mark Abraham, Richard Archibald, and Addi Malviya-Thakur

“Monorepos for Research Software Projects”, Geoffrey Lentner

“Best practices for documenting a scientific Python project”, Gavin Wiggins, Gregory Cage, Robert Smith, Seth Hitefield, Marshall McDonnell, Lance Drane, Jesse McGaha, Michael Brim, Mark Abraham, Rick Archibald, and Addi Malviya Thakur

“Employing Research Community Networks to Assess Research Software Engineer (RSE) Impact”, Chris Reidy, Dhruvil Shah, Gil Speyer, and Jason Yalim

Session 5: Panel - Sponsors

Session 6A: Software Engineering Practice

“An Experience Report on Incorporating Evidence-Based Practice Techniques in a Research Software Engineering Team”, Reed Milewicz, Jon Bisila, Miranda Mundt, and Joshua Teves

“Challenges and Strategies for Testing Automation Practices at Sandia National Laboratories”, Miranda Mundt, Jon Bisila, Reed Milewicz, Joshua Teves, Michael Buche, Jonathan Compton, Jason Gates, Kirk Landin, and Jay Lofstead

“Software engineering to sustain a high-performance computing scientific application: QMCPACK”, William Godoy, Steven Hahn, Michael Walsh, Philip Fackler, Jaron Krogel, Peter Doak, Paul Kent, Alfredo Correa, Ye Luo, and Mark Dewing

“Peer Code Review in Research Software: Enhancing Quality and Collaboration”, Jeffrey Carver and Nasir Eisty

“Metamorphic Testing for Scientific Software”, Sebastian Müller

“Scaling research data management: how relational databases unlock new potential in experiment traceability, throughput, automation, and insights”, Eric Wang, and Steven DeMartini

“Utilizing Complexity Metrics During Code Reviews to Promote Software Sustainability”, James Willenbring and Gursimran Walia

“Embracing the (re)decentralized web for sustainable research collaboration cyberinfrastructure”, T. Andrew Manning

Session 6B: Research & Development Tooling 2

“Visual Studio Code on HPC Clusters: Unleashing the Power of Visual Studio Code for High-Performance”, Zhiyong Zhang

“Extrapolation and Interpolation in Machine Learning Modeling with Fast Food and astartes”, Jackson Burns, Kevin Spiekermann, and William Green

“Stylo2gg: Visualizing reproducible stylometric analysis”, James Clawson

“Research data with Airtable: processing and parsing”, William Dewey

“Anotemos: A Media Annotation Software for Efficient Research Analysis”, Srikanth Lavu and Patricio Herbst

“Porting irregular bioinformatics algorithms to GPUs and integrating them in large scale bioinformatics applications”, Muaaz Awan

“Multi-Mesh Operations (MMOPS): A Library for Mesh Agnostic Implementation of 3D Physics”, Juan Sáenz and Clell Solomon

“Exploring biologically relevant chemical space - The data way”, Laura Crawford, Adam Nelson, Silvia Ramos, and Mark Basham

Session 7A: ML/AI

“Navigating the Integration of Machine Learning into Domain Research”, Bernie Boscoe, Tuan Do, and Noah Mogenson

“Supporting Nuclear Energy Research with MLOps”, Brandon Biggs and Dunya Bahar

“Integrating Adversarial Observation: Enhancing Robustness, Interpretability, and Trustworthiness in Machine Learning Systems”, Jamil Gafur, Suely Oliveira, and William Lai

“Towards Generating Software Requirements from the Socio-Technical Artifacts of Climate Models”, Swapnil Hingmire, Ahmed Awon, Zane Li, and Neil Ernst

Session 7B: Research & Development Tooling 3

“Open Free Energy: An Open Source Ecosystem for Alchemistry”, David W.H. Swenson, Irfan Alibay, David L. Dotson, Benjamin Ries, Mike M. Henry, and Richard J. Gowers

“Enhancing Reproducibility in Environmental Health Studies through a Systematic Approach to Streamline Exposure Data Processing”, Kezia Irene, Michelle Audirac, Michael Bouzinier, Mahmood Shad, Scott Yockel, Francesca Dominici, and Danielle Braun

“Developing Performance Portable Astrophysics Codes for Exascale Supercomputers”, Forrest Glines

“LINCC - developing software for large-scale analysis of data”, Neven Caplar

Session 8: Culture & Community 2

“Princeton University’s RSE Summer Internship and Fellowship Programs”, Joel Bretheim, Ian Cosden, Peter Elmer, Garrett Wright, Colin Swaney, Abhishek Biswas, Henry Schreiner, Kilian Lieret, and Vineet Bansal

“Focus Groups to Focus Efforts: Building the NERSC User Community of Practice”, Lipi Gupta, Kevin Gott, Annette Greiner, Charles Lively, Erik Palmer, Hannah Ross, and Rebecca Hartman-Baker

“INnovative Training Enabled by a Research Software Engineering Community of Trainers (INTERSECT)”, Ian Cosden and Jeffrey Carver

Session 9: Panel - Funding Agencies

How can funding agencies and foundations support successful career paths for RSEs in academia and at national labs?