The aim of this project is to modify the Python library, snnTorch, to account for the adjacency of nearby neurons when firing off. Once adjacency is determined, the weights of neurons are adjusted accordingly. Our main objective was to test the theory that taking neuron adjacency into account will have a positive effect on overfitting, a common problem when designing Neural Networks. In order to complete this project, we did research into Spiking Neural Networks, and how we could utilize the concepts of fractal growth and Cayley trees to simulate natural neuron growth. To complete this project, we utilized common Python libraries such as NumPy, MatPlotLib, snnTorch, Math, Itertools, and Sys. Additionally, we utilized the CIFAR-10 and MNIST datasets to train our neural network. With these factors in place, tests were performed using the newly modified code to examine the accuracy and level of overfitting present in the new program. The results showed that our changes did not hurt the accuracy, and that it showed a minor performance increase in regard to overfitting. However, these results will require further testing in the future. This project presents a way to create neural networks that has a positive effect on overfitting, without sacrificing accuracy of results during training.
Geostreams is a comprehensive platform developed by the National Center for Supercomputing Applications, designed to facilitate open-source data management and visualizations for geospatial data. The framework enables the conversion of heterogeneous data, particularly time-series geospatial data, into a flexible schema with an online dashboard for interactive data visualizations, allowing end-users to effectively engage with the data.
The Geostreams framework comprises three primary components. Firstly, the underlying database is a robust database built on PostgreSQL. Leveraging the capabilities of PostGIS and the PostgreSQL JSON binary (JSONB) data type, this component efficiently handles the storage and management of geospatial and temporal data by structuring data sources, parameters and time series data streams into JSON documents that can be queried directly or in aggregate (such as weekly totals). The second component, the Geostreams API, a RESTful service, is written in Scala and supports automated ingest and aggregation of time-series geospatial data. For a wider reach, Pygeostreams, a Python wrapper library, is also available to interface with the API. The final component is the Geodashboard, a rich client web platform, developed in React.js and leveraging D3.js and Vega.js, dedicated to visualizing and downloading parsed data, allowing users to gain insights and facilitate further analysis.The dashboard is composed of many modular components that can be enabled and configured depending on the type and scale of data being displayed.
To enhance the framework’s capabilities, Geostreams seamlessly integrates with Clowder, a customizable and scalable data management framework designed for long-tail data. This allows Geostreams to provide long-term archiving and efficient raw file management.To ensure portability and ease of deployment, each component is containerized using Docker and can be quickly built with Docker Compose.
The effectiveness and versatility of the Geostreams framework have been validated through its successful implementation in projects such as the Great Lakes to Gulf, Great Lakes Monitoring, ARPA-E Smartfarm, and NSF Critical Interface Network. The framework provided a solid foundation that could be easily customized to meet the specific needs of each project, reducing manual effort and saving valuable time. Future plans for Geostreams, include migrating the API services to Django for seamless integration with advanced extensions like timescaleDB, enabling real-time data aggregation. Additionally, improvements to the data model are envisioned to enhance efficiency.
The Geostreams framework serves as an excellent example of a well-structured, modular, and extensible system that can be applied to other projects and domains apart from geospatial. The patterns and principles of containerization, scalability, and seamless integration with other frameworks foster an efficient and adaptable approach to software development enabling RSEs to minimize effort without compromising on quality.
Over the last 40 years, network analysis has emerged as a prominent approach to data-intensive research. Despite this steady growth and investment, network analysis remains something of a niche specialty that can exclude novice users who usually only receive standard statistical training. Moreover, since much of the growth in network science has tended to be field-specific, tools and formats have developed independently across disciplines. The multiplicity of formats and sensitivity of social data makes existing records difficult to share across scholars in the field, limiting the opportunity for new findings on the already accumulated body of network data. IDEANet - Integrating Data Exchange and Analysis for Networks - aims to maximize scientific discovery in human network science by significantly lowering the analytic and access barriers-to-entry for researchers. IDEANet is supported by the National Science Foundation as part of the Human Networks and Data Science - Infrastructure program (BCS-2024271 and BCS-2140024).
IDEAnet features three key components (1) a suite of analysis tools developed in R which automatically generate standardized network analytic measures (2) a GUI (Graphical User Interface) which gives access to the aforementioned measurements through an easy-to-use menu-based program and (3) a secure data repository that routinizes the capacity for archiving and accessing network data, including sensitive data.
The analysis tools are distributed as a package and built with real-world data constraints in mind to allow novice users the ability to gain substantive results as efficiently (but still accurately) as possible. Core metrics comprise 17 node-level measurement (e.g., degree, centralities, reachability) in addition to 27 system-level metrics (e.g., network size, dyad census, transitivity). Additional modules include multiple regression QAP, multi-relational blockmodeling and a community-detection routine that partitions the network based on 10 commonly used methods and evaluates their concordance using CHAMP. Further modules are in development including meta-population disease simulation and dynamic network diffusion simulation.
The secure data repository is hosted on Dataverse in collaboration with Duke University Library. Researchers are often interested in sharing their data but can be limited by strict institutional requirements. Our repository facilitates this transition by offering three levels of data security: Open access, Secure non-restricted, and Secure use-restricted. Secure non-restricted data include some level of confidentiality such that investigators require IRB approval for access. Secure use-restricted data requires both IRB approval for use and further substantive limitations required by the data owner. To accommodate the diversity in requirements, IDEAnet makes use of the imPACT architecture – a “notary service” that seamlessly matches user certification and data access requirements.
The difficult learning curve involved in learning network tools means that researchers with substantive interests in network processes but who are not specially trained in network methods must either invest heavily in training or risk serious analytic errors. The goal of IDEANet is to provide an integrated network data analysis framework within R that capitalizes on the best of current tools while building robust safeguards against common data and analytic errors.
Recently the US-DOE Office of Advanced Scientific Computing Research (ASCR) office funded several seedling projects to conceptualize software sustainability organizations (SSOs). One of them, the Collaboration for Better Software (for Science), or COLABS, has research software engineering as its centerpiece. It aims to provide a wide range of services to client software projects and the broader community in partnership with ASCR’s user facilities. These services include essential and advanced software engineering services, and place a strong emphasis on workforce development and retention by providing long-term stability, training, and support to enable and encourage RSEs and other staff to build their careers and excel in this role. With this poster, we will engage with the RSE community; to get their input, and refine our objectives for the SSO, and how these objectives can be met.
COLABS is envisioned as a multi-institutional distributed organization that will initially be anchored by three DOE national laboratories (ANL, LBNL, and ORNL), but can expand to include a wide variety of institutions including universities, industry, non-profit foundations, etc. In addition to providing services directly, COLABS RSEs will also become ambassadors for changing the perception of RSE roles in scientific computing.
Randomized controlled trials (RCTs) are considered the highest level of evidence to establish causal associations in clinical research. However, problems with design, execution or reporting of the trial process can lead to unreliable findings, excessive costs, and, potentially, harm for patients. Clinical trials often suffer from poor methodological and reporting quality (also known as rigor and transparency, respectively). Two reporting guidelines, CONSORT(Consolidated Standard of Reporting Trials) and SPIRIT(Standard Protocol Items: Recommendations for Interventional Trials) have been designed to promote complete and clear reporting in RCT publications (results publications and protocols, respectively). Using these guidelines, the validity and applicability of the RCT findings can be better assessed. Although endorsed by many high-impact medical journals, adherence to these guidelines remains suboptimal, possibly because journals lack methods for enforcement and verification, which involves a substantial amount of journal staff or editorial time.
RCTCheck uses natural language processing techniques and data management software (Clowder) to analyze RCT manuscripts and identify information related to rigor and transparency as defined in these guidelines. RCTCheck analyzes user uploaded manuscripts, identifies sections and sentences, and using a Transformer-based deep learning model (PubMedBERT), classifies sentences into individual items in CONSORT and SPIRIT checklists and generates a report on the transparency of the manuscript. This report can assist authors in checking the completeness of their reporting, and journals in maintaining high reporting standards, and other stakeholders of clinical research in critically assessing the quality of clinical trials, synthesizing evidence and promoting open science practices, leading to better clinical care and treatments.
The user management functionality in Clowder provides different types of access for researchers looking to use the labeled and processed clinical manuscript data and authors looking to publish their work by ensuring that the manuscripts align with the guidelines. An author will be able to upload a manuscript to RCTCheck and download a report of its critical appraisal. A researcher is able to process a larger number of published reports and manuscripts and conduct large-scale analyses on reporting quality. They will also be able to provide feedback on model predictions, which can potentially improve the deep learning models.
The system leverages the Clowder data management framework to provide a backend for the execution of PubMedBERT and pre- and post- processing steps, including PDF extraction and visualizations of results. A special purpose web client provides a specialized user interface for authors and publishers. We present the overall architecture of the system and the current implementation with emphasis on leveraging existing generic cyberinfrastructure and extending it for specific use cases.
Significant efforts have been placed on creating big data management architectures, but there is still a large middle ground between these projects of the largest scale and a small data project that requires little data management. This middle ground which, depending upon hardware utilized, may be some-where in the range of millions to tens of millions of records in scale. This scale is just big enough to create headaches if that data isn’t handled and managed efficiently. Additionally, projects of this scale may not have the resources to support the tools and people required to manage a system intended for data at a larger scale. The data lakehouse architecture presented here aims to fill this niche by delivering a solution that is performant at this scale and is built with open source technology. As a case study of this architecture, we showcase what we developed to manage data for MOSSAIC (Modeling Outcomes using Surveillance data and Scalable AI for Cancer), a natural language processing deep learning joint project between the Department of Energy (DOE) and the National Cancer Institute (NCI). The research element of the project is focused on (1) developing large-scale, state-of-the-art foundation models for clinical information extraction, (2) building new capabilities for biomarker and recurrence identification and detection, (3) pushing novel research in uncertainty quantification, explainability, and safe deployment, so that production AI models can be effectively and reliably deployed in real-time at the population level, (4) expanding the implementation of these tools in real world cancer registry settings as well as other clinical settings such as health care facilities, clinical laboratories etc., and (5) enabling large-scale foundation model training on DOE Leadership Computing Facility supercomputers. Project data is sourced from the National Cancer Institute’s Surveillance, Epidemiology, and End Results Data Management System (SEER*DMS) which acts as a central registry of cancer data consolidating information from individual state cancer registries. A research endeavor of this nature creates many requirements and constraints. For example, we are working with highly sensitive personally-identifiable cancer patient data that, according to our data use agreements (DUAs), required us to build a system that operates in a network disconnected, air-gapped environment. Additionally, we needed to provide the capability to limit access to segments of some data when needed. Finally, the solution needed to be easily shareable with external healthcare institutions where there may be budget, hardware, or data management expertise constraints. The solution presented is a data lake design that can be set up on any file system as well as any object storage system that utilizes the S3 API to store data from any type of source, and is loosely coupled with an in-process serverless analytical database management system that can both catalog the files in the data lake and query them through a no-code GUI SQL editor utilizing a JDBC driver or through code with a number of language-specific client APIs including Python. The open sourced nature of the project offers solutions for the science community, and shared practices for processing and storing data may lead to easier data harmonization and increased reproducibility. This architecture ultimately gives mid-sized data projects a direction for a system that will scale, offers excellent query performance, is resource efficient, and is flexible for future technology changes.
In the realm of advanced research, diverse data sources and complex computational workflows pose significant challenges in achieving centralized data management, efficient computation, and reproducibility. This poster proposes a sophisticated pipeline that combines the solutions to the existing problems in a centralized location.
The presented pipeline encompasses two indispensable components: the input preparation pipeline and the computation pipeline, both meticulously designed to address the aforementioned obstacles. The input preparation pipeline seamlessly harmonizes data collection from a multitude of platforms, including the likes of XNAT, One Drive, and other various cloud storage services. The computation pipeline, on the other hand, showcases the remarkable capabilities of SLURM and HPC clusters, empowered by the implementation of singularity containers, ensuring not only performance and scalability, but also reproducibility.
Furthermore, the poster uses BRILLIANCE (Brain tailoRed stImulation protocoL for acceLerated medIcal performance) as a use case to establish its utility, relevance and significance in the field of advanced research. The use case seamlessly integrates BIP (BRILLIANCE Input Preparer) for comprehensive data preparation and uses singularity containers and computational resources of ACCRE (Advanced Computing Center for Research and Education at Vanderbilt University) to obtain scalability, elevated performance and reproducibility.
By embracing this pipeline, researchers and practitioners can seamlessly integrate their data and compute. Furthermore, as the computational units are containerized, researchers can easily ensure reproducibility. This poster aspires to empower researchers to exercise their full potential and facilitate advancing strides in their respective fields.
This poster presents the findings of research on the prevalence of research software as academic research output within international institutional repositories (IRs), often termed Research Information Systems (RIS). These platforms contain permanent metadata records of research output from the university. While these exist mainly to meet funders' open access requirements, they also serve to replace the old homepage of individual academics and to provide metadata on their contents for services that aggregate harvested content, thus increasing the FAIRness of the artifacts. Expanding on work conducted on UK-only repositories by using source data from OpenDOAR, a directory of global Open Access Repositories, similar analyses were applied to international IRs in what we believe is the first such census of its kind. 4,970 repositories from 125 countries were examined for the presence of software, along with repository-based metadata for potentially correlating factors. It appears that much more could be done to provide trivial technical updates to RIS platforms to recognise software as distinct and recordable research output in its own right. This poster will present the main results and the software approach used to examine such a large quantity of IRs, allowing future work to pivot on the datasets found.
Jupyter Notebooks are open-source tools researchers commonly use to develop workflows and other software. Researchers and RSEs alike are most likely familiar with the Classic Notebook interface, the original web application for creating and sharing notebooks, but there are several other coding environments to choose from. An Integrated Development Environment (IDE) is a software application that provides helpful features beyond traditional source code editors, such as debuggers, for developing software. However, IDEs such as VSCode can present a barrier to entry for researchers familiar with other tools. JupyterLab, an alternative developed by Project Jupyter, is an extensible development environment for notebooks that comes with many IDE-like features, including a debugger and tab expansion. Additionally, the community maintains many other helpful extensions that do not ship with the default environment. Our JupyterIDE project collects and curates useful extensions and provides notebook-based tutorials for how to use them. Tutorial-style notebooks include notebooks on Vim keybindings, which make cell manipulation faster and easier, and language server processing, which provides code auto-completion and linting features. Tools like these can make JupyterLab an ideal environment for developing research workflows that can be used by seasoned RSEs who are accustomed to IDE features in collaboration with researchers who may not have interest in investing time into learning a new tool. JupyterIDE makes these tools more accessible for users and promotes software engineering best practices in a research environment.
Software plays a key role in the scientific discovery process. The user experience and sustainability of scientific software are critical to ensure the production of knowledge and scientific progress. Today, scientific software programs and projects often do not have the methods, processes, or best practices that are necessary to ensure high quality usable software. Knowledge from commercial software cannot be directly applied to scientific software due to differences in resource allocation, organizational structures, target audience, and scientific goals. To understand and bridge these gaps, our project, Scientific software Research for User experience, Design, Engagement and Learning (STRUDEL) is developing a typology of scientific software work and a design framework to understand and support the scientific software process including user interface design and development. We believe that this typology and design framework is necessary for research software engineering (RSE) practice to develop usable and sustainable software.
A sociologically informed typology helps break down diverse scientific projects for analysis and comparison by stakeholders. The STRUDEL typology is designed to guide scientific users who have questions about what types of user experience and software sustainability work to invest in, as well as when to do so. Our aim is for this typology to be a strategic thinking partner to guide project leaders, funding officers, domain experts, software developers, and so on when thinking about their product’s needs for UX, software sustainability, and overall strategy. It unpacks connections between organizational (project structures & funding), social (roles of people & structure of teams), and technical issues (technology stacks, target users, etc.) that shape scientific software to help users answer key strategic questions.
The STRUDEL design framework provides fundamentals and guidelines along with standard components and generalized UI flows for accomplishing specific tasks in scientific user interfaces that can be reused and customized. This science design framework will enable science teams to design and implement more usable and effective interfaces to address these unique challenges.
Overall, the STRUDEL project aims to bolster scientific software development efforts by improving the user experience, software quality, and software sustainability. In this poster, we will discuss our work and its broader applicability to RSE practice.
Heterogeneous data is all around us and is defined as one that comes with many variations, types, incomplete details, and sometimes inaccurate information. This data can take different forms in large academic communities like universities, including information about courses, health and wellness activities, research, events, user groups, transportation, food and dining, buildings and rooms, and facilities. Accessing and processing heterogeneous data comes with challenges, including finding data and relevant metadata, various data formats, systems, communication protocols for accessing data, authentication and authorization methods, and incomplete and sometimes incorrect or obsolete information. Understanding and processing heterogeneous data is challenging, but it presents numerous opportunities for gaining deeper insights about the community and enabling data-driven decision-making, potentially leading to a better experience for all university participants (students, faculty, staff, alums, and other community members). Rokwire is an open-source platform for developing mobile applications intending to empower smart, healthy communities. It envisions integrating and processing a wide range of information and providing access through a mobile application that is personalized and privacy aware. The Rokwire platform includes core functionalities or Building Blocks that communicate with different systems to process raw data and make it available in a format multiple client applications can consume and deliver to its users. We briefly discuss the Rokwire platform and its capabilities around heterogeneous data processing within a large academic community, focusing on two of its functionalities: events data processing and managing software contributions to the platform. For each of these functionalities, we discuss its key characteristics, how it aids in reducing the barrier to heterogeneous data processing and increasing data access within academic communities, the current implementation, and future work. Events data come from different sources, with similarities and variations in the data format and content. We architected and developed an Events Manager web application that lets users create and manage events and process event data from different sources. The backend uses an Events Building Block web service module that standardizes event data and stores it in a database. When working with heterogeneous data, one might encounter data that cannot be directly integrated or shared for different reasons, like legacy file formats or data sharing restrictions. For data that cannot be directly shared with the platform, we have developed a Contributions Catalog web application supported by a Contributions Building Block web service module. External collaborators willing to contribute and integrate software modules that process such data with the Rokwire platform can use this application to share details of those modules, including purpose, data needs, protocols of data use, and removal. After a thorough and successful review, such third-party applications can be integrated with the Rokwire platform and made available to users. In future work on event data processing, we will continue adding new capabilities like enhancing event data (e.g., finding more accurate event locations when the data is incomplete) and improving its usage across the platform. In managing software contributions to the platform, we plan to provide enhanced review and publishing capabilities, including support for deployment. We also discuss our ongoing work of migrating these modules to a newer technology stack. We conclude by briefly discussing the collaboration of Research Software Engineers (RSEs) in architecting and developing the Rokwire platform and discussing the value added by the platform for the university community.
Insights into the complex interactions and consequences of turbulent and stratified shear flows are critical for deciphering heat transfer in the ocean and its role in the global climate system. This work is part of an effort to classify the regime of a stratified inclined duct (SID) flow in real time, enabling researchers to prioritize relevant experimental data and minimize storage costs. Here, the authors have performed curvelet analysis to extract key features and textures from shared SID shadowgraphs (see  fig. 1 for the experiment’s setup). The Fast Discrete Curvelet Transform, provided by the CurveLab Matlab toolbox, has been proposed by Cand`es and Denoho , and is effective in its windowing approach, which separates a given signal into both different scales and orientations (angles) before performing ridgelet transforms. The spatial domain of the curvelet transform’s coarsest and finest scales, performed on a turbulent flow, are shown below; note the different textures that are extracted from the original image.
Researchers are increasingly using web applications to promote their work in an accessible and engaging format. By leveraging interactive visualizations and intuitive interfaces, researchers can effectively share their data and code within the scientific community. RSEs may be interested in working with researchers to build web applications that have the potential to improve code and data reuse. Despite the value of these communication tools, maintaining them eventually falls to the researchers who are not incentivized to learn new tools and technologies. We present a network analysis visualization tool that is demonstrative of how an existing research workflow in a Jupyter Notebook can be transformed into complex web applications without leaving the JupyterLab development environment. This application uses Jupyter widgets (ipywidgets) to add interactive components - such as sliders, dropdown menus - and a network visualization widget (ipycytoscape) to visually explore and analyze a large citation network. Voila strips away code cells, leaving behind only interactive browser components, resulting in a fully-fledged user interface. By adapting existing workflows, researchers working with RSEs can benefit from the familiarity of the codebase and the development environment. This helps them maintain the application beyond the period of collaboration. Based on our experience, we recommend that researchers and RSEs consider the adoption of Jupyter Notebook and Jupyter widgets to transform existing workflows into intuitive, interactive, and aesthetic web applications that can effectively communicate their research findings.
Releasing software is an important part of the work of many RSEs. Modern software stacks can often be complex, even within the scope of a single project or set of projects (a "software ecosystem"). One of the pivotal decisions in how to approach managing this complexity is the organization of the software into "repositories". The two ends of the spectrum are to consolidate everything into a single repository -- the "monorepo" approach -- or to use different repositories for each possible identifiable module -- the "multirepo" approach. We have been personally involved in managing both kinds of approaches, and have learned some of the pros and cons as they apply to the software release managers and various project developer roles.
Almost all software projects begin with a single repository to coordinate the work of the initial developer team (perhaps one person). The path to a monorepo or multirepo starts later as more modules and use-cases are added to the core functionality. Adding more repositories tends to add complexity for developers who must work in multiple ones, and for software release managers who must look across repositories for issues and milestones. If repositories span different project namespaces, e.g. GitHub "organizations", then additional steps are added to daily tasks of linking issues or pull-requests, authorizing users, and different types of communication. On the other hand, many tools associated with distributing software are more naturally scoped to a repository rather than a part of a directory tree, including most importantly packaging systems, but also cloud-based documentation and containerization. The packaging and release modularization of the multirepo approach greatly increases flexibility in dealing with the perennial problems of dependency management, which scale super-linearly with the number of different project components and developer sub-teams (in any organization scheme).
We have found that, largely due to dependency management and its effect on code reliability and reproducibilty, the advantages of monorepo approaches at smaller scales for developer teams and code complexity start to be outweighed by disadvantages in somewhat the same way horses pulling stagecoaches works for a single family but not a group of 50 -- there is no longer enough flexibility to changes in the landscape and the interactions among the people being conveyed.
The poster will summarize our key findings, along with a description of the technical and organizational context that our experience is based on, as we found this to be of primary importance to help the audience evaluate how much our conclusions are directly applicable to their use case.
Over the last two decades, large-scale supercomputers and data centers have evolved to have a more heterogenous set of general and special-purpose processing units, e.g., multi-core CPUs, GPUs, FPGAs, DPUs, and generally xPUs, on nodes. Heterogenous parallel programming models, e.g., OpenMP, CUDA/HIP/SYCL, OpenACC, provided building blocks – via compiler and runtime systems – to harness the power and capabilities this hardware and have demonstrated large speedups through research studies. Yet, when used in production-level scientific software, these programming models not only can inhibit performance portability, but they also are error-prone because of their complex behavior and interaction with the base language’s semantics. Following a natural path to maturity, heterogeneous parallel computing has recently facilitated for programming productivity via software technology for heterogeneous parallel programming abstractions to eventually be integrated in a base language, e.g., C++, Python, Fortran. A set of such libraries uses heterogeneous programming models in their backend. Kokkos-core, and the complementary kernel library, Kokkos-kernels, both a part of the Kokkos project, is a prevalent example; Kokkos is currently developed by the U.S. Department of Energy (DoE) and used widely across DoE applications, and its capabilities are already being considered as part of the C++26 standard’s C++ parallel STL. Kokkos training and tutorials give users an intuitive understanding of how to develop Kokkos programs, regardless of Kokkos backend, and there are different channels for training through, e.g., example programs and tutorial videos. The Klokkos X-Stack Project (Klokkos is a combination of KLEE and Kokkos) in the U.S. Department of Energy aims to provide automated testing and analysis for Kokkos users to run Kokkos program and understand common API usage mistakes without needing actual hardware or a platform, i.e., there is no need to have GPU or other accelerators and kernel libraries like BLAS.Algorithmic/automated testing of parallel programs is computationally intractable due to the problem of the number of different paths a program can take. In the context of research software engineering, we ask: can a tool for automated testing of parallel programs using concolic analysis, coupled with a complementing set of parallel programming examples which are classified as correct/incorrect (thereby offering ground truth), help reduce the computational intractability of automated testing of parallel programs and thereby reduce the human burden of debugging of parallel programs? We will answer this in two steps. First, we will showcase automated analysis tool for Kokkos parallel programs that uses guided symbolic execution through an LLVM-based Klee plugin; this automated analysis can be considered a first pass of the compiled program for bug detection and a step before the expensive dynamic analysis. Second, we will then identify the ways that a set of community-gathered parallel programming examples classified as correct or incorrect can further improving tractability concolic analysis of parallel programs.