Talks


The growing recognition of research software as a fundamental component of the scientific process has led to the establishment of both Open Source Program Office (OSPO) as a Research Software Engineering (RSE) groups. These groups aim to enhance software engineering practices within research projects, enabling robust and sustainable software solutions. The integration of an OSPO into an RSE group within a university environment provides an intriguing fusion of open source principles and research software engineering expertise. The utilization of students as developers within such a program highlights their unique contributions along with the benefits and challenges involved.

Engaging students as developers in an OSPO-RSE group brings numerous advantages. It provides students with valuable experience in real-world software development, enabling them to bridge the gap between academia and industry. By actively participating in open source projects, students can refine their technical skills, learn industry best practices, and gain exposure to collaborative software development workflows. Involving students in open source projects enhances their educational experience. They have the opportunity to work on meaningful research software projects alongside experienced professionals, tackling real-world challenges and making tangible contributions to the scientific community. This exposure to open source principles and practices fosters a culture of innovation, collaboration, and knowledge sharing.

This approach also raises questions. How can the objectives and metrics of success for an academic OSPO-RSE group be defined and evaluated? What governance models and collaboration mechanisms are required to balance the academic freedom of researchers with the community-driven nature of open source? How can the potential conflicts between traditional academic practices and the open source ethos be effectively addressed? How can teams balance academic commitments with project timelines? These questions highlight the need for careful consideration and exploration of the organizational, cultural, and ethical aspects associated with an OSPO acting as an RSE group within a university.

Leveraging student developers in an OSPO-RSE group also presents challenges that need careful consideration. Students may have limited experience in software engineering practices, requiring mentoring and guidance to ensure the quality and sustainability of the research software they contribute to. Balancing academic commitments with project timelines and expectations can also be a challenge, necessitating effective project management strategies and clear communication channels. Furthermore, the ethical considerations of involving students as developers in open source projects must be addressed, ensuring the protection of intellectual property, respecting licensing requirements, and maintaining data privacy.

The involvement of students as developers within an OSPO-RSE group offers valuable benefits. The effective integration of students in this context requires thoughtful planning, mentorship, and attention to ethical considerations. This talk will examine the experience of the Open Source with SLU program to explore the dynamic role of student developers in an OSPO-RSE program and engage in discussions on best practices, challenges, and the future potential of this distinctive approach to research software engineering within academia.

Conda is a multi-platform and language agnostic packaging and runtime environment management ecosystem. This talk will briefly introduce the conda ecosystem and the needs it meets, and then focus on work and enhancements from the past 2 years. This includes speed improvements in creating packages and in deploying them; work to establish and document standards that broaden the ecosystem; new support for plugins to make conda widely extensible; and a new governance model that incorporates the broader conda community.

The conda ecosystem enables software developers to create and publish easy to install versions of their software. Conda software packages incorporate all software dependencies (written in any language) and enable users to install software and all dependencies with a single command. Conda is used by over 30 million users worldwide. Conda packages are available in both general and domain specific repositories. conda-forge is the largest conda-compatible package repository with over 20,000 available packages for Linux, Windows, and macOS. Bioconda is the largest domain specific repository with over 8,000 packages for the life sciences.

Conda also enables users to set up multiple parallel runtime environments, each running different combinations and versions of software, including different language versions. Conda is used to support multiple projects that require conflicting versions of the same software package.

Alchemical free energy methods, which can estimate relative potency of potential drugs by computationally transmuting one molecule into another, have become key components of drug discovery pipelines. Despite many ongoing innovations in this area, with several methodological improvements published every year, it remains challenging to consistently run free energy campaigns using state-of-the-art tools and best practices. In many cases, doing so requires expert knowledge, and/or the use of expensive closed source software. The Open Free Energy project (https://openfree.energy/) was created to address these issues. The consortium, a joint effort between several academic and industry partners, aims to create and maintain reproducible and extensible open source tools for running large scale free energy campaigns.

This contribution will present the Open Free Energy project and its progress building an open source ecosystem for alchemical free energy calculations. It will describe the kind of scientific discovery enabled by the Open Free Energy ecosystem, approaches in the core packages to facilitate reproducibility, efforts to enhance scalability, and our work toward more community engagement, including interactions with both industry and with academic drug discovery work. Finally, it will discuss the long-term sustainability of the project as a hosted project of the Open Molecular Software Foundation, a 501(c)(3) nonprofit for the development of chemical research software.

This talk presents research on the prevalence of research software as academic research output within international institutional repositories (IRs). This work expands on previous research, which examined 182 academic IRs from 157 universities in the UK. Very low quantities of records of research software were found in IRs and that the majority of IRs could not contain software as independent records of research output due to the underlying Research Information System (RIS) platform. This has implications for the quantities of software returned as part of the UK's Research Excellence Framework (REF), which seeks to assess the quality of research in UK universities and specifically recognises software as legitimate research output. The levels of research software submitted as research output have declined sharply over the last 12 years and this differs substantially to the records of software contained in other UK research output metadata (e.g. https://gtr.ukri.org). Expanding on this work, source data from OpenDOAR, a directory of global Open Access Repositories, were used to apply similar analyses to international IRs in what we believe is the first such census of its kind. 4,970 repositories from 125 countries were examined for the presence of software, along with repository-based metadata for potentially correlating factors. It appears that much more could be done to provide trivial technical updates to RIS platforms to recognise software as distinct and recordable research output in its own right. We will discuss the implications of these findings with a focus on the apparent lack of recognition of software as discrete output in the research process.

Numerical research data is often saved in file formats such as CSV for simplicity and getting started quickly, but challenges emerge as the amount of data grows. Here we describe the motivation and process for how we moved from initially saving data across many disparate files, to instead utilizing a centralized PostgreSQL relational database. We discuss our explorations into the TimescaleDB extension, and our eventual decision to use native PostgreSQL with a table-partitioning schema, to best support our data access patterns. Our approach has allowed for flexibility with various forms of timestamped data while scaling to billions of data points and hundreds of experiments. We also describe the benefits of using a relational database, such as the ability to use an open-source observability tool (Grafana) for live monitoring of experiments.

Accurately capturing the huge span of dynamical scales in astrophysical systems often requires vast computing resources such as those provided by exascale supercomputers. The majority of computational throughput for the first exascale supercomputers is expected to come from hardware accelerators such as GPUs. These accelerators, however, will likely come from a variety of manufacturers. Each vendors uses its low-level programming interface (such as CUDA and HIP) which can require moderate to significate code development. While performance portability frameworks such as Kokkos allow research software engineers to target multiple architectures with one code base, adoption at the application level lags behind. To address this issue in computational fluid dynamics, we developed Parthenon, a performance portable framework for block-structured adaptive mesh refinement (AMR), a staple feature of many fluid dynamics methods. Parthenon drives a number of astrophysical and terrestrial plasma evolution codes including AthenaPK, a performance portable magnetohydroynamics (MHD) astrophysics code based on the widely used Athena++ code. Running AthenaPK on Frontier, the world’s first exascale supercomputer, we explore simulations of galaxy clusters with feedback from a central supermassive black hole. These simulations further understanding of the thermalization of feedback in galaxy clusters. In this talk we present our efforts developing performance portable astrophysics codes in the Parthenon collaboration and our experience running astrophysics simulations on the first exascale supercomputer. LA-UR-23-26938

To address the lack of software development and engineering training for intermediate and advanced developers of research software, we present INnovative Training Enabled by a Research Software Engineering Community of Trainers (INTERSECT). INTERSECT, funded by NSF, provides expert-led training courses designed to build a pipeline of researchers trained in best practices for research software development. This training will enable researchers to produce better software and, for some, assume the role of Research Software Engineer (RSE).

INTERSECT sponsors an annual RSE-trainer workshop, focused on the curation and communal generation of research software engineering training material. These workshops connect RSE practitioner-instructors from across the country to leverage knowledge from multiple institutions and to strengthen the RSE community. The first workshop, held in 2022 laid the foundation for the format, structure, and content for the INTERSECT bootcamp, a multi-day, hands-on, research software engineering training event.

INTERSECT brings together RSE instructors from U.S. institutions to exploit the expertise of a growing RSE community. In July 2023, INTERSECT will sponsor a week-long bootcamp for intermediate and advanced participants from U.S. institutions. We will use an existing open-source platform to make the INTERSECT-curated material from the bootcamp available. The public availability of the material allows the RSE community to continually engage with the training material and helps coordinate the effort across the RSE-trainer community.

In this talk we will introduce the INTERSECT project. We will discuss outcomes of the 2023 bootcamp, including curriculum specifics, lessons learned, and long term objectives. We will also describe how people can get involved as contributors or participants.

Machine learning models, specifically neural networks, have garnered extensive recognition due to their remarkable performance across various domains. Nevertheless, concerns pertaining to their robustness and interpretability have necessitated the immediate requirement for comprehensive methodologies and tools. This scholarly article introduces the "Adversarial Observation" framework, which integrates adversarial and explainable techniques into the software development cycle to address these crucial aspects.

Industry practitioners have voiced an urgent need for tools and guidance to fortify their machine learning systems. Research studies have underscored the fact that a substantial number of organizations lack the necessary tools to tackle adversarial machine learning and ensure system security. Furthermore, the absence of consensus on interpretability in machine learning presents a significant challenge, with minimal agreement on evaluation benchmarks. These concerns highlight the pivotal role played by the Adversarial Observation framework.

The Adversarial Observation framework provides model-agnostic algorithms for adversarial attacks and interpretable techniques. Two notable methods, namely the Fast Gradient Sign Method (FGSM) and the Adversarial Particle Swarm Optimization (APSO) technique, have been implemented. These methods reliably generate adversarial noise, enabling the evaluation of model resilience against attacks and the training of less vulnerable models.

In terms of explainable AI (XAI), the framework incorporates activation mapping to visually depict and analyze significant input regions driving model predictions. Additionally, a modified APSO algorithm fulfills a dual purpose by determining global feature importance and facilitating local interpretation. This systematic assessment of feature significance reveals underlying decision rules, enhancing transparency and comprehension of machine learning models.

By incorporating the Adversarial Observation framework, organizations can assess the resilience of their models, address biases, and make well-informed decisions. The framework plays a pivotal role in the software development cycle, ensuring the highest standards of transparency and reliability. It empowers a deeper understanding of models through visualization, feature analysis, and interpretation, thereby fostering trust and facilitating the responsible development and deployment of AI technologies.

In conclusion, the Adversarial Observation framework represents a crucial milestone in the development of trustworthy and dependable AI systems. Its integration of robustness, interpretability, and fairness into the software development cycle enhances transparency and reliability. By addressing vulnerabilities and biases, organizations can make well-informed decisions, improve fairness, and establish trust with stakeholders. The significance of the framework is further underscored by the pressing need for tools expressed by industry practitioners and the lack of consensus on interpretability in machine learning. Ultimately, the Adversarial Observation framework contributes to the responsible development and deployment of AI technologies, fostering public trust and promoting the adoption of AI systems in critical domains.

Research software plays a crucial role in advancing research across many domains. However, the complexity of research software often makes it challenging for developers to conduct comprehensive testing, which leads to reduced confidence in the accuracy of the results produced. To address this concern, developers have employed peer code review, a well-established software engineering practice,to improve the reliability of software. However, peer code review is less prevalent in research software than in open-source or traditional software domains. This presentation addresses this topic by describing a previous investigation of peer code review in research software. Then it concludes with a description of our current work and ways for interested people to get involved.

In our previous study, we interviewed and surveyed 84 developers of research software.The results show research software teams do perform code reviews, albeit without a formalized process, proper organization, or adequate human resources dedicated to conducting reviews effectively. In the talk, we will describe the results in more detail. The application of peer code review holds promise for improving the quality of research software, thereby increasing the reliability of research outcomes. Additionally, adopting peer code review practices enables research software developers to produce code that is more readable, understandable, and maintainable.

This talk will then briefly outline our current work to engage interested participants. Our current work focuses on peer code review processes as performed specifically by Research Software Engineers (RSEs). The research questions we aim to address in this ongoing study are: RQ1: What processes do RSEs follow when conducting peer code review?; RQ2: What challenges do RSEs encounter during peer code review?; and RQ3: What improvements are required to enhance the peer code review process for RSEs?

To answer these questions, we plan to conduct the following phases of the project:Phase 1: Surveying RSEs to Gain Insights into Peer Code Review Practices; Phase 2: Conducting Interviews and Focus Groups with RSEs to Explore Peer Code Review Experiences; Phase 3: Observational Study of RSE Peer Code Review

There are numerous places for members of the US-RSE community to get involved in our research. We will highlight these opportunities in the talk.

Princeton University’s central Research Software Engineering (RSE) Group is a central team of research software engineers who work directly with campus research groups to create the most efficient, scalable, and sustainable research codes possible in order to enable new scientific and scholarly advances. As the Group has grown, it has evolved in numerous ways, with new partners across academic units, new partnership models and operational procedures, and a reshuffled internal organization. In the summer of 2023, the RSE Group further evolved by incorporating, for the first time, two formal programs for RSE interns and fellows.

We present an experience report for the inaugural RSE summer internship and fellowship programs at Princeton University. These two programs, separate but concurrently held during the summer of 2023, represented our first formal attempt to introduce currently enrolled students to the RSE discipline in a structured setting with assigned mentors and well-defined projects. The projects varied widely in nature, spanning academic units in mathematics, social sciences, machine learning, molecular biology, and high energy physics. The interns and fellows were exposed to a diverse range of RSE programming languages, software packages, and software engineering best practices.

The two programs, with eight total student participants, further spanned in-person and remote work, undergraduate and graduate students, and multiple continents. We report on the experience of the interns, fellows, and mentors, including lessons learned and recommendations for improving future programs.

In codes used to simulate multi-physics hydrodynamics, it is common for variables to reside on different parts of a mesh, or on different, but related, meshes. For example, in some codes all variables may reside on cell centers, while in others, scalars may reside on cell centers, vectors on cell faces and tensors on cell corners, etc. Further, different methods may be used for the calculation of derivatives or divergences of different variables, or for the interpolation or mapping of variables from one mesh to another. This poses a challenge for libraries of 3D physics models, where the physical processes have dependencies on space, and therefore, mesh dependency. For such libraries to be able to support a general set of multi-physics hydrodynamics host codes, they must be able to represent the physics in a way that is independent of mesh-related details. To solve this problem, we present a Multi-Mesh Operations (MMOPS) library for the mesh-agnostic representation of 3D physics.

MMOPS is a light-weight C++ abstraction providing an interface for the development of general purpose calculations between variables of different user-defined types, while deferring the specification of the details of these types to be provided by the host code. As an example, consider three variables, a, b, c representing a vector and two scalar quantities residing on the cell corners, cell centers and cell corners of a mesh, respectively. MMOPS provides a `class mapped_variable` to represent these variables, for which the host code provides an arbitrary compile-time tag indicating where the variable data resides on the mesh, and a mapping function method used to indicate how to map the variable from one part of the mesh to another, using tag dispatching under the hood. This way, we can perform operations using `a`, `b`, `c`, namely `mapped_variable` instantiations representing a, b, c, such as `c(i) = a(c.tag(), i, dir) + b(c.tag(), i)` where, `i` is an index representing the ith cell on `c`’s mesh, and `dir` is an index representing the desired directional component of vector `a`. In general, if either `a` or `b` have a different mesh representation than `c`, then they get mapped to `c`’s mesh using the mapping functions provided by the host code when constructing `a` and `b`, which can be different. In the above example, since `a` is on the same mesh as `c`, it doesn’t get mapped but instead is directly accessed, and since `b` lives on a different mesh than `c`, it gets mapped to `c`’s mesh, i.e. to cell corners.

We demonstrate that MMOPS provides a zero to nearly zero-cost abstraction of the de- scribed functionality, in the sense that it incurs in little to no performance penalty (depending on compiler) compared to a manual benchmark implementation of the same functionality. A description of the library, usage examples, and benchmarking tests will also be presented.

Background: The NSAPH (National Studies on Air Pollution and Health) lab focuses on analyzing the impact of air pollution on public health. Typically, studies in Environmental Health require merging diverse datasets coming from multiple domains such as health, exposures, and population demographics. Exposure data is obtained from various sources and presented in different formats, posing challenges in data management and integration. The manual process of extracting and aggregating data is repetitive, inefficient, and prone to errors. Harmonizing formats and standards is difficult due to the mix of public and proprietary data sources. Reproducibility is crucial in scientific work, but the heavy computations involved in processing exposure data are sensitive to the computational environment, making different versions of components potentially affect the results. Additionally, while some exposure data are public datasets that are shareable for reproducible research, there are some exceptions of proprietary data, such as ESRI shapefile, that cannot be publicly shared, further complicating reproducibility efforts in NSAPH.

Aim: Our main objective is to leverage our expertise in bioinformatics to create a robust data platform tailored for aggregating exposure and spatial data. We are building a data platform that is focused on exposure data, such as pollution, temperature, humidity, and smoke. This platform incorporates a deployment system, package management capabilities, and a configuration toolkit to ensure compatibility and generalizability across various spatial aggregations, such as zip code, ZCTA (zip code tabulation areas), and counties. Through the development of this data platform, our aim is to streamline exposure data processing, enable efficient transformation of exposure data, and facilitate reproducibility in working with exposure data.

Methods: The methodology employed in this study utilizes the CWL (Common Workflow Language) programming language for the data pipeline. Docker is utilized for deployment purposes, while PostgreSQL serves as the data warehouse. Apache Superset is employed for data exploration and visualization. The study incorporates various types of data transformations, including isomorphic transformations for reversible conversions and union transformations for combining data from different sources. Additionally, rollups are performed to extract specific data elements, and approximations are used for imprecise conversions. In the case of tabular data, simple aggregations are conducted using SQL functions. For spatial operations, the methodology includes adjustable rasterization methods such as downscaling and resampling (nearest, majority). Furthermore, a data loader is developed to handle diverse spatial file types, including NetCDF, Parquet, CSV, and FST.

Results: We developed DORIEH (data for observational research in environmental health), a comprehensive system designed for the transformation and storage of exposure data. Within this system, we implemented various utilities, notably the gridMET (Grid-based Meteorological Ensemble Tool) data pipeline, which encompasses multiple crucial steps. These steps include retrieving shapefiles from an API, importing exposure spatial data, aggregating exposure data from grids into selected spatial aggregations (e.g ZCTA, county, zip codes) utilizing adaptable rasterization methods, and ultimately storing the transformed data into our dedicated database. Furthermore, we constructed a versatile data loader capable of handling diverse file types, while also incorporating parallelization techniques to enhance processing efficiency. Additionally, DORIEH provides basic visualization capabilities that serve as quality checks for the data.

Conclusion: Our study addresses the challenges faced by the NSAPH group in analyzing the impact of environmental exposures on health. By developing the DORIEH data platform and implementing various utilities, including the flexible and configurable spatial aggregation data pipeline and a flexible data loader, we have made significant progress in overcoming the limitations of manual data extraction and aggregation. The platform enables streamlined exposure data processing, efficient transformation, and storage, while ensuring compatibility and reproducibility in working with exposure and spatial data.

The High Throughput Discovery Lab at the Rosalind Franklin Institute aims to iteratively expand the reaction toolkit used in drug discovery to enable new regions of biologically relevant chemical space to be explored. The current reaction toolkit, that underpins traditional drug discovery workflows, has to date been dominated by a small number (<10) of reaction classes that have remained largely unchanged for 30 years. It has been argued that this has contributed to attrition and stagnation in drug discovery. We are working to create a semi-automated approach to explore large regions of chemical space to identify novel bioactive molecules. The approach involves creating arrays of hundreds of reactions, in which different pairs of substrates are combined. The design of subsequent reaction arrays is informed by the biological activity of the products that are obtained. However, the multi-step laboratory process to create, purify and test the reaction products introduces high requirements for data linkage, making data management incredibly important.

This talk focuses on how we built the data infrastructure using a mix of open-source and licensed technology. We will discuss how the Franklin aims to increase automation in data processing and how the technology we have implemented will make this possible.

The National Energy Research Scientific Computing Center, NERSC, is a U.S. Department of Energy high-performance computing facility, used by over 9000 scientists and researchers for novel scientific research. NERSC staff support and engage with users to improve the use of this resource. We have now begun working towards creating a community of practice consisting of NERSC users, staff, and affiliates to pool resources, knowledge and build networks across scientific fields. Our highly interdisciplinary users have expertise in research computing in their respective science fields, and as such, access to a collective knowledge in the form of a community allows resource sharing, peer-mentorship, peer teaching and collaboration opportunities. Thus, we believe the benefits of such a community could lead to improved scientific output due to better peer-support for technical and research related issues.

In order to prepare a community creation strategy, gain insight into user needs, and understand how a community of practice could support those needs, NERSC staff conducted focus groups with users in the Spring of 2023. The findings from these focus groups provided significant insight into the challenges users face in interacting with one another and even with NERSC staff, and will inform our next steps and ongoing strategy. This presentation will outline the current state of the NERSC user community, the methodology for running user focus groups, qualitative and quantitative findings, and plans for building the NERSC user community of practice based on these findings.

The emergence of the Research Software Engineer (RSE) as a role correlates with the growing complexity of scientific challenges and the diversity of software team skills. At the same time, it is still a challenge for research funding agencies and institutions to directly fund activities that are explicitly engineering focused.

In this presentation, we describe research software science (RSS), an idea related to RSE, that is particularly suited to research software teams. RSS focuses on using the scientific method to understand and improve how software is developed and used in research. RSS promotes the use of scientific methodologies to explore and establish broadly applicable knowledge. Specifically, RSS incorporates scientific approaches from cognitive and social sciences in addition to existing scientific knowledge already present in software teams. By leverage cognitive and social science methodologies and tools, research software teams can gain better insight into how software is developed and used for research and share that insight by virtue of the scientific approaches used to gain that insight.

Using RSS, we can pursue sustainable, repeatable, and reproducible software improvements that positively impact research software toward improved scientific discovery. Also, by introducing an explicit scientific focus to the enterprise of software development and use, we can more easily justify direct support and funding from agencies and institutions whose charter is to sponsor scientific research. Direct funding of RSS activities is within these charters and RSE activities are needed, more easily justified, and improved by RSS investments.

Assembly and analysis of metagenomics datasets along with protein sequence analysis are among the most computationally demanding tasks in bioinformatics. ExaBiome project is developing GPU accelerated solutions for exascale era machines to tackle these problems at unprecedented scale. The algorithms involved in these software pipelines do not fit the typical portfolio of algorithms that are amenable to GPU porting, instead, these are irregular and sparse in nature which makes GPU porting a significant challenge. Moreover it is a challenge to integrate complex GPU kernels within a CPU optimized software infrastructure that depends largely on dynamic data structures. This talk will give an overview of development of sequence alignment and local assembly GPU kernels that have been successfully ported and optimized for GPU based systems and the integration of these kernels within Exabiome software stack for demonstrating unprecedented capability of solving scientific problems in bioinformatics.

LINCC (LSST Interdisciplinary Network for Collaboration and Computing) is an ambitious effort to support the astronomy community by developing cloud-based analysis frameworks for science expected from the new Large Survey of Space and Time. The goal is to enable the delivery of critical computational infrastructure and code for petabyte-scale analyses, mechanisms to search for one-in-a-million events in continuous streams of data, and community organizations and communication channels that enable researchers to develop and share their algorithms and software.

We are particularly interested in supporting early science and commissioning efforts. The team develops software packages in different areas of interest to LSST; such as RAIL, a package for estimating redshift from photometry, and KMBOD, a package for detecting slowly moving asteroids. I will concentrate on our effort with LSDB and TAPE packages that focus on cross-matching and time-domain analysis. We are currently developing capabilities to: i) efficiently retrieve and cross-match large catalogs; ii) facilitate easy color-correction and recalibration on time-domain data from different surveys to enable analysis on long lightcurves; iii) provide custom functions that can be efficiently executed on large amounts of data (such as structure function and periodogram calculations); iv) enable large-scale calculation with custom, user-provided functions, e.g., continuous auto-regressive moving average models implemented in JAX.

LINCC is already supporting the community through an incubator program and a series of technical talks. I will present these efforts and show our current status, results, and code; and discuss the lesson learned from collaboration between scientists and software engineers!

This presentation describes training activities carried out under the auspices of the U.S. Department of Energy’s; in particular, those facilitated by the Exascale Computing Project (ECP). While some of these activities are specific to members of ECP, others are beneficial to the HPC community at large. We report on training opportunities and resources that the broad Computer Science and Engineering (CS&E) community can have access to, the relevance of coordinated training efforts, and how we envision such efforts beyond ECP’s scope and duration.

Scientific software plays a crucial and ever-growing role in various fields by facilitating complex modeling, simulation, exploration, and data analysis. However, ensuring the correctness and reliability of these software systems presents significant challenges due to their computational complexity, their explorative nature, and the lack of explicit specifications or even documentations. Traditional testing methods fall short in validating scientific software comprehensively – in particular for explorative software and simulation tools suffer from the Oracle Problem. In fact, Segura et al. show that scientific and explorative software systems are inherently difficult to test. In this context, metamorphic testing is a promising approach that addresses these challenges effectively. By exploiting the inherent properties within scientific problems, metamorphic testing provides a systematic means to validate the accuracy and robustness of scientific software while avoiding the challenges posed by the Oracle Problem. The proposed talk will highlight the importance of metamorphic testing in scientific software, emphasizing its ability to uncover subtle bugs, enhance result consistency, and show approaches for a more rigorous and systematic software development process in the scientific domain.

Introduction: We are pleased to submit a proposal to the US-RSE to include Anotemos, a media annotation software developed by the GRIP (Grasping the Rationality of Instructional Practice), an education research lab at the University of Michigan. Anotemos is designed to enhance research analysis by providing an efficient media annotation solution. This proposal highlights the key features, and benefits of Anotemos, showcasing its relevance to the research community.

Background and Objectives: Anotemos addresses the challenges faced by researchers in analyzing multimedia data, enabling them to extract valuable insights efficiently. Anotemos aims to streamline and automate the annotation workflow, offering researchers a user-friendly interface and advanced features for seamless annotation.

Key Features and Benefits: Using Anotemos, researchers can create Commentaries that are centered around a Multimedia Item. Anotemos offers a range of features that set it apart: (i) Comprehensive Annotation: Researchers can annotate various media types, including images, videos, and audio, with the support of diverse customizable annotation types such as text, icons, drawings, bounding boxes, and audio recordings both on-screen and off-screen; (ii) Real-time Collaboration: Anotemos enables multiple researchers to collaborate in real-time simultaneously, fostering knowledge exchange, reducing redundancy, and improving productivity; (iii) Share & Publish: Anotemos offers both private and public sharing, enabling the secure sharing of Commentaries with select individuals or the wider research community; (iv) Customizable Workflows: Anotemos supports the creation of customized workflows by making it easier to create identical Commentaries, and manage different sets of collaborators using Commentary sections, enabling researchers to tailor the platform to their projects; (v) Analysis & Reports: Using Anotemos, users can perform an in-depth analysis of annotated data and generate comprehensive reports, providing valuable insights and facilitating data-driven decision-making in research projects; (vi) Integrations: Anotemos offers seamless integration with Learning Tools Interoperability (LTI), allowing users to easily embed and access Anotemos Commentaries within LTI-compliant learning management systems. Furthermore, Anotemos supports LaTeX code, empowering users to annotate and display mathematical equations and formulas with precision. Additionally, Anotemos can be embedded into Qualtrics Surveys enabling the researchers to collect and analyze the survey data along with Anotemos annotations.

Conclusion: We believe that Anotemos has the potential to significantly enhance research analysis by providing an efficient, user-friendly, and customizable media annotation solution. It is developed using the Angular-Meteor framework, leveraging its robustness, scalability, and real-time capabilities. The software is currently in beta testing, with positive feedback from researchers in diverse fields. Its advanced features make it an ideal tool for researchers across various domains. We request the consideration of Anotemos for presentation at the US-RSE Conference, where researcher software engineers can gain insights into its features, benefits, and potential impact on research projects.

Managing massive volumes of data and effectively making it accessible to researchers poses significant challenges and is a barrier to scientific discovery. In many cases, critical data is locked up in unwieldy file formats or one-off databases and is too large to effectively process on a single machine. This talk explores the role of Kubernetes, an open-source container orchestration platform, in addressing research data management challenges. I will discuss how we are using a set of publicly available open-source and home-grown tools in the National Renewable Energy Lab (NREL) Data, Analysis, and Visualization (DAV) group to help researchers overcome data-related bottlenecks.

The talk will begin by providing an overview of the data challenges faced in research data management, including data storage, processing, and analysis. I will highlight Kubernetes' ability to handle large-scale data by leveraging containerization and distributed computing, including distributed storage. Kubernetes allows researchers to encapsulate data processing infrastructure and workflows into portable containers, enabling reproducibility and ease of deployment. Kubernetes can then schedule and manage the resource allocation of these containers to enable efficient utilization of limited computing resources, leading to more efficient data processing and analysis.

I will discuss some limitations of traditional, siloed approaches to dealing with data and emphasize the need for solutions which foster collaboration. I will highlight how we are using Kubernetes at NREL to facilitate data sharing and cooperation among research teams. Kubernetes' flexible architecture enables the deployment of shared computing environments, such as Apache Superset, where researchers can seamlessly access and analyze shared datasets. Providing the ability to have one research team easily consume data generated by another, utilizing Kubernetes' as a central data platform, is one of the major wins we’ve encountered by adopting the platform.

Finally, I will showcase real-world use cases from NREL where we have used Kubernetes to solve some persistent data challenges involving large volumes of sensor and monitoring data. I will discuss the challenges we encountered when creating our cluster and making it available as a production-ready resource. I will also discuss the specific suite of tools, including Postgres and Apache Druid for columnar and timeseries data, and Redpanda Kafka for streaming data we have deployed in our infrastructure, and the process that went into the selection of these tools.

Attendees of this talk will gain insights into how Kubernetes can address data challenges in research data management. This talk aims to provide researchers with a framework and a set of building blocks which have worked well for us, in order for them to unlock the full potential of their data in the era of software-enabled discovery.

The Globus platform enables research applications developed by research teams to leverage data and compute services across many tiers of service—from personal computers and local storage to national supercomputing centers—with minimal deployment and maintenance burden. Globus is operated by the University of Chicago and is used by nearly all R1 universities, national labs, and supercomputing centers in the United States, as well as many smaller institutions.

In this talk, we’ll introduce the Globus Platform-as-a-Service, including how to register an application and how to access Globus APIs using our Python SDK. We will present examples of how the various Globus services, interfaces, and tools may be used to develop research applications. We will demonstrate authentication and access control with Globus’s Auth and Groups APIs; making data findable and accessible using Globus guest collections, data transfer API, and indexed Search API; and automating research with Globus Flows and Compute APIs.

The expected long-term outcome of our research is to determine the ways in which a focus on scientific model code quality can improve both scientific reliability and model performance. In this work, we focus on climate models which are complex software implementations of the scientific and mathematical abstractions of systems that govern the climate.

Climate models integrate the physical understanding of climate and produce simulations of climate patterns. The model code is often hundreds of thousands of lines of highly sophisticated software, written in high-performance languages such as Fortran. These models are developed with teams of experts, including climate scientists and high-performance computing developers. The sheer complexity of the software behind the climate models (which have evolved over many years) makes them vulnerable to performance, usability, and maintainability bugs that hinder performance and weaken scientific confidence. Understanding how the social structures behind the software and model function have been examined but the complex interactions of domain experts and software experts are poorly understood.

The expected short-term outcomes of our research are a) develop a set of software quality indicators for the climate models, providing model maintainers and users with specific areas to improve the model code; b) develop new techniques for analyzing large bodies of unstructured text to explain how users perceive a research software project’s capabilities and failings aligned with international efforts to improve the quality of scientific software.

We follow two main approaches. (1) \textit{Analytics}: analysis of climate models and their artifacts. These artifacts include the software code itself, including the way the model is deployed and run with end users; the bug reports and other feedback on the software quality; and the simulation testing used to validate the model outputs. Our analysis will be incorporated into a Fortran analysis tool to identify Fortran quality issues automatically. This analysis is incomplete however without a clear understanding of the social context in which it was produced. (2) \textit{Social Context}: we examine the socio-technical artifacts created around climate models. These include a) outputs from workshops, interviews, and surveys with stakeholders, including climate scientists and model software developers; b) explicit issue and bug reports recorded on the model, such as the fact the model is failing/crashing at a particular place; c) implicitly discussed software problems and feature requests from related discussion forums. We hypothesize that both approaches will help to identify technical debts in the climate models.

Manually eliciting software requirements from these diverse sources of feedback can be time-consuming and expensive. Hence, we will use state-of-the-art Natural Language Processing (NLP) approaches that require less human effort. The web artifacts, interviews, and scientific texts of climate models provide reflections on the Climate Community of Practice (CoP). We will use an unsupervised topic model, Latent Dirichlet Allocation (LDA) to study how different narratives (such as specific modules of climate models, research interests, and needs of the community members) of a climate CoP have evolved over a period.

LDA assumes that documents are mixtures of topics, and a topic is a probability distribution over words. The gist of a topic can be inferred from its most probable words. We focus on the discussion forum of the Community Earth System Model (CESM)\footnote{\url{https://www.cesm.ucar.edu/}}. CESM is a collaboratively developed fully coupled global climate model focused on computer simulations of Earth's climate states.

We infer 15 topics on around 7000 posts from the year 2003 to 2023 on the discussion forum using MALLET topic modeling library\footnote{\url{https://mimno.github.io/Mallet/}}. We ask a domain expert to assign a label to a topic and use the labels to analyze the posts. We plot the proportions of words in a document assigned to a topic over a period. We observe certain trends over around 20 years. The topic of \textit{version management of CESM models} gained attention from 2018 on-wards till 2021. There is a steady increase in discussion of the topic of \textit{Coupled Model Intercomparison Project (CMIP) and Shared Socio-economic Pathways (SSPs)}\footnote{\url{https://en.wikipedia.org/wiki/Shared_Socioeconomic_Pathways}}. Discussions about issues related to \textit{installing and setting up CESM models} and \textit{parallel computing} have declined since 2016 and 2020 respectively. Topics of \textit{source code related changes}, \textit{ocean modeling}, and \textit{errors encountered while running CESM models} are discussed throughout the period. We also observe that questions related to \textit{parallel computing} received fewer responses, while questions related to \textit{compiling CESM models} received higher responses. These trends and insights can be used by the software engineering group of CESM to prioritize its actions to facilitate the use of the models.

We believe that this qualitative study scaffolded using topic models will complement climate science research by facilitating the generation of software requirements, onboarding of new users of the models, responding to new problems using solutions to similar problems, and preventing reinvention.

Academic research collaborations involving members from multiple independent, often international institutions are inherently decentralized, and they encounter the same general challenges in sharing data and collaborating online as does the internet at large when there are no viable or desirable centralized solutions. At NCSA we have developed a free and open source full-stack cyberinfrastructure (CI) for research collaborations based on OpenStack and Kubernetes that is reproducible, flexible, portable, and sustainable. We are embracing the lessons and technology of the (re)decentralized web to curate a suite of open source tools and web services that prioritize data ownership to give researchers as much control as possible over their data, communications, and access controls. I will present the architecture of our framework as well as example use cases from existing projects ranging from astronomy to nuclear physics, showcasing the flexible deployment system and some collaborative tools and services for identity and access management, messaging, data sharing, code development, documentation, high-performance computing and more. By the end of the talk I hope you will have some appreciation for the value decentralized tech brings to the research enterprise and the potential for innovation that lies in the creative integration of existing federated and peer-to-peer applications and protocols.

Airtable is an increasingly popular format for entering and storing research data, especially in the digital humanities. It combines the simplicity of spreadsheet formats like CSV with a relational database’s ability to model relationships; enterers or viewers of the data do not need to have programming knowledge. The Center for Digital Research in the Humanities at the University of Nebraska uses Airtable data for two projects on which I work as a developer. African Poetics has data focusing on African poets and newspaper coverage of them, and Petitioning for Freedom has data on habeas corpus petitions and involved parties. At the CDRH, our software can take data in many formats, including CSV, and ingest it for an API based on Elasticsearch. This data is then available for search and discovery through web applications built on Ruby on Rails. The first step in ingesting the Airtable data into our system is to download it. I will cover the command line tools that can do this, the formats that can be downloaded (JSON turned out to be the most convenient), and the requirements for authentication. Python (aided by Pandas) can transform this data into the CSV format that our back end software. I will discuss how to rename and delete columns, change data back into JSON (which is sometimes more parsable by our scripts), and clean troublesome values like blanks and NaNs. One advancement of Airtable over CSV is join tables that have similar functionality to SQL databases. Incorporating them into other projects has particular challenges. When downloaded directly from Airtable, the join data is in a format that cannot be interpreted by humans or programs other than Airtable. But it can be converted (with the help of some processing within airtable) into formats that can be parsed by external scripts so that it can be human-readable. With these transformations, our software can use the table to populate API fields and parse it into arrays and hashes to replicate relationships within the data. Finally I will discuss the advantages and disadvantages of Airtable for managing data, from the perspective of a developer who uses the data on the front and back end of web applications.