Developing an Integrated, Interdisciplinary RSE Department at Sandia National Laboratories
Reed Milewicz, James Willenbring and Dena Vigil
The past decade has seen a dramatic growth in demand for scientific computing, and there is a well-recognized need for professionals who can advance software engineering practice within the scientific domain. To meet these workforce requirements, we need organizational structures that give recognition to software-focused personnel and provide pathways for career advancement. On the institutional front, many research organizations have made strides towards creating positions and groups for RSEs. Within the Center for Computing Research at Sandia National Laboratories, the recently formed Department of Software Engineering and Research fills this role. As an RSE team, our department provides flexible, on-demand staffing for development, consultation, and support to other departments within the Center for Computing Research. Key to our strategy is integrating RSEs into a cross-cutting R&D team alongside personnel with complementary skillsets, namely software engineering research, DevOps, and IT service management.
The conceptual model for our organization follows what we call a Research, Develop, and Deploy (RDD) workflow pattern, in which staff members play mutually reinforcing roles. We conduct fundamental and applied research in software engineering (Research), team with application and algorithm researchers to provide embedded development, maintenance, and support (Develop), and provide robust, scalable, and sustainable infrastructure and IT services to our center (Deploy). Having a workforce that spans the entire R&D lifecycle from early research to customer support makes us far more productive than if we solely specialized in embedded development work. Moreover, we believe this cross-cutting model helps amplify the impact of our RSEs and supports their career growth. In this talk, we describe our department’s strategy and our experiences as an RSE team within a scientific computing center. We hope to contribute to a broader discussion on cultivating and sustaining career pathways for RSEs.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of NTESS, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04 94AL85000. SAND2021-5264 A
Help! I’m a Research Software Manager!
Any path forward for research software development requires well-run, engaged, and thriving research software teams. We know that research software development teams are too important to be managed poorly. But no one teaches us to be good managers — especially in academia.
It doesn’t have to be this way, though. Managing well is not a personality trait; it is a set of practices and skills that can be learned. Excellent practices, and the reasons they work, have been known for decades; they’ve been recently re-“discovered” by big technology companies, such as the re:Work effort that came out of Google’s Project Oxygen, and have propagated through the industry. Learning from emerging best practices in the tech industry makes sense; the experimental, iterative work we do in research software is much more like that of startups than it is like that of, say, higher-ed IT.
And the thing is, the more advanced and challenging of those skills and practices are things that our experiences in research (like building and maintaining collegial, multi-institutional collaborations) have already helped us develop. If we take the time and discipline to learn and practice the basics, we can quickly become good-to-great managers: helping our teams be more effective, supporting more research better, and making us all less stressed in and happier with our jobs.
This talk is aimed at research software managers, or research software developers who think they might be interested in being a team lead or manager some day. Using the re:Work effort as a starting point, we’ll cover what good teams have, and four simple but key practices many research software managers need in their toolbox: embracing your new role; weekly one-on-ones; frequent and specific feedback; and delegation.
Towards a Culture of Continuous Improvement within RSE Teams
Derek Trumbo and Reed Milewicz
At Sandia National Laboratories, computation plays an essential role in every element of our national security mission. As such, producing high quality software is at the forefront of our thinking. Keeping pace with the demand for quality is challenging for many reasons, including the labs’ complex and evolving objectives and disruptive changes in computing architectures. The software systems our RSE team members will work on are large in scope and will contain many technologies, programming languages, patterns, and architectures. As research software engineers with diverse backgrounds and skillsets, we have to stay current with tools and best practices, and we must always be looking for better ways to design, develop, and maintain software. For this to happen, teams must adopt a culture of continuous improvement – a culture that values those activities – and all team members must feel responsible for and invested in that continuous improvement.
To that end, we have investigated strategies to promote long-term growth and learning and to build team consensus around software quality practices. In this talk we will present an experience report showcasing two RSE teams at Sandia where we have implemented these strategies: the Tech Talks of the Avondale team and the Best Practices meeting series of the Department of Software Engineering and Research. Tech Talks are technical deep dives into specific technologies, architectures, and/or libraries that team members could be faced with in their day-to-day duties. Meanwhile, Best Practices meetings are weekly round-table discussions where team members join together to deliberate and discuss the processes and principles that lead to high-quality software. We will explore the impact that these activities have had on our teams and make recommendations for how RSE teams at other institutions can implement similar continuous improvement activities.
A Voyage to Research Software Engineering: Ten Years Later
In September of 2010, being a professional software developer, I gave a talk at a JavaOne conference in San Francisco recapitulating my experience of working with a researcher on a research software package . My main observation was a much lower impedance mismatch between languages we speak compared to that with a user of industrial software. We could compare a researcher and a software engineer to an English and a Japanese speaker, who even if they do not know each other’s language are able to communicate using their hands, body language and old-fashioned printed dictionary. When a software engineer talks to a user in industry, it is more akin to communication between a cat and a dog, and thus requires an army of business developers, business consultants, product managers and software architects to transmit even a simple message.
In the last ten years I became much more immersed in research software engineering, working with different PIs on different projects in different roles. Today I want to discuss some mentality mismatches between industry and academia. The first I would call a visionary-pragmatic mismatch: folks in industry prefer to talk about their vision for a future work while academics are only comfortable discussing what has already been done, validated and peer reviewed. Another one I would call a 90/10 phenomena. Software developers know that the last 10% of work usually takes 90% of our time. This is a hard to stomach message for many researchers. Finally I see a much bigger fragmentation of skills and knowledge base in the RSE compared to industrial software development. Most young and talented software engineers in the industry are keen to learn and apply new tools and languages even the latter come from very different realms. In RSE, people tend to be more conservative and rely on the proven tools and technologies.
I would like to illustrate the fragmentation of skill sets with my experience to use OLAP technologies and workflow management tools. References L. Y. Yampolsky and M. A. Bouzinier, “Evolutionary patterns of amino acid substitutions in 12 Drosophila genomes,” BMC Genomics, vol. 11, no. SUPPL. 4, p. S10, Dec. 2010, doi: 10.1186/1471-2164-11-S4-S10.
Research Software Engineering as an emergent profession: A sociological perspective
Research software engineering is a rapidly growing international movement with established professional organizations in several countries, but there are still many open questions about the scope and agenda of the field, and ultimately what it means to identify professionally as an RSE. This talk will review sociological and historical research on professional movements in science and computing to identify guideposts and pitfalls that may be relevant to the future of research software engineering. In particular, it will look at how the professional characteristics of research software engineers may connect with those of software engineers, research scientists, and skilled laboratory technicians. Based on a series of interviews with scientific computing practitioners, it will also consider how cultural differences between academic research institutions and national laboratories impact the relevance of the RSE identity in these organizations.
PresQT Services - A Path Forward for Connecting Platforms While Increasing Quality and FAIRness of Data and Software
Sandra Gesing, Natalie Meyers, Rick Johnson and John Wang
Researchers and educators applying computational methods mostly prefer a small set of computational environments and software for their research and teaching. Each additional software package or additional science gateway means a learning curve and time investment into research infrastructure instead of focusing on research questions and/or teaching. Thus, there’s value in saving time and effort on the users’ side through integration of new features with well-established platforms.
The PresQT (Preservation Quality Tool) framework has been designed with the researchers’ daily routine in mind. PresQT eases the use of repositories and serves as boilerplate between existing platforms, science gateways and preservation systems while adding beneficial metadata and FAIR tests (Findability, Accessibility, Interoperability, and Reuse). The main goal of the services is to support sharing, preserving and measuring FAIRness of data and software - a crucial topic for many academic projects and open science. One reason for the importance of this topic is that a variety of scientists are interested in assuring reproducibility of their results and providing long-term archival access to their data and software.. Another reason lies in demands by funding bodies for researchers to report results and assure their data and software is preserved in such a way that it is accessible and reusable after a project ends.Typically, scientists reach out to digital librarians for support for the preservation process at the end of the lifecycle of projects. The point of time creates not only a tight schedule but also risks the loss of important intermediate data. Additionally, preservation tasks are more labor intensive if they are not considered at progressive stages of the project life cycle but only at the end.
PresQT and its standards-based design with RESTful web services are informed via user-centered design and a collaborative open-source implementation effort that enables seamless integration into the research ecosystem.PresQT services form the connection between science gateways, tools, workflows and databases to existing repositories. Current integration partners and implementations for select open APIs include Jupyter, OSF, CurateND, EaaSI (Emulation-as-a-Service Infrastructure), WholeTale,and HUBzero along with GitHub, GitLab, Zenodo, and FigShare APIs. The diversity of partners and integrations contributes to better understanding the needs of the stakeholders of PresQT services.
PresQT services are easily integratable and target systems can be added via extending JSON files and Python functions. Data is packaged as BagITs for uploads, downloads and transfers. The current services include transfers with fixity checks supporting diverse hash algorithms, keyword enhancement via SciGraph, upload, download and connection to EaaSI services. FAIR tests are available via the services provided by FAIRsharing and FAIRShake. PresQT provides indicators that report how FAIR the data in a target repository is when assessed and offers additional recommendations for improving FAIRness. To present the capabilities to interested developers of computational solutions, users of PresQT services and funding bodies, we have developed a demo user interface that allows for demoing and testing the different features of PresQT services.
We will present the services of PresQT in the demo user interface and the API.
A Tiered Approach to Software Quality in National Laboratory Research Science
Miranda Mundt and Wade Burgess
Software quality rarely requires an introduction these days and is an undeniable necessity for our technology-rich world. In research science, however, what constitutes software quality is sometimes an open question. How can you assure quality on a cutting edge research topic or a proof-of-concept project?
During 2020 members of the Software Engineering and Research Department at Sandia National Laboratories developed a tiered approach to software quality which establishes a flexible software quality standard for software projects at various points in their development. The standard accounts both for the eventual intended use of the software (e.g., for use by a domain expert or for use by a general population) and the current stage of development (e.g., proof of concept, pre-deployment, etc.).
The ultimate goal of this effort is to promote a quality baseline for a wide range of scientific software projects, which in turn contributes to better reproducibility, replicability, and software stability. This talk will discuss the inspiration and challenges behind the tiered approach to software quality as well as the details of the four defined tiers.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of NTESS, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04 94AL85000. SAND2021-5276 A
Strategies for containerizing applications for use in HPC
Angel Pizarro and Christian Kniep
Researchers interested in computational reproducibility of their analysis, and workflow portability across different environments, are increasingly turning to containers. Containers allow for packaging up a researcher’s scripts, the underlying applications and their dependencies so the application runs quickly and reliably from one computing environment to another.
While there exists a nascent community focused on bringing the benefits of containers to HPC, there are still a lot of open questions as to what are the best practices and solutions to take advantage of all of what an HPC resource has to offer. This is especially true for applications that require access to specialized architectures, or run across nodes and have particular networking requirements.
In this workshop, we will review the benefits and challenges with adopting containers by HPC administrators, RSEs, and end users. We will also cover some best practices and guidance for how to containerize applications for use across local and remote HPC resources, and enabling collaboration.
Sustainability as a Key Enabler for RSE Paths Forward
Many of us are confronted with the word “sustainability” in reference to our research software projects. Often this as a welcome concept but an unwelcome word in terms of its implications. Inevitably, sustainability comes down to money: even volunteer efforts at some point relate to money as those volunteers are not doing other activities with the time they volunteer. Yet, sustainability may be one of the paths forward for research software engineers (RSE). Consider many of the goals of any software professional: interesting work, job security, sufficient resources, collaboration, freedom of schedule, path for advancement, and pay. The fact that most research projects are temporary works against nearly all of these goals. On most project teams, everyone knows where the end of the runway is financially, and this forces risk-based employment decisions on the RSE. For projects that consider sustainability after the grant, nearly all of the above goals can be met. The HUBzero® project is one such project. HUBzero is a software platform for building and operating science gateways. The HUBzero team engages in consulting for platform customizations, core development of the platform, and standing up and operating gateways for its clients. Serving over 25 paying clients annually provides sufficient variety of activity and sufficient growth opportunities in a many areas. RSEs have the opportunity to initially work as members of a matrix organization in several project roles, and can advance to project lead roles over time. With the “no grant end date” duration of the project, RSEs have the opportunity over time to move across technology areas and into management roles. In several instances, the project has taken fresh graduates and grown them into senior management roles over the course of 10 years. In other cases, team members who joined the team as generalists have grown into subject matter experts in areas like cybersecurity, user experience, middleware, and so forth. Also needing consideration is the environment in which the project is housed. The HUBzero project recently moved from Purdue University to the San Diego Supercomputer Center (SDSC). SDSC has specialized in science gateways for over a decade. In this environment, other members of SDSC have had the ability to join in HUBzero activities, and HUBzero team members will have new opportunities to move across divisions within the organization while still keeping a focus on science gateways. All of these aspects feed the sustainability of the project, which in turn provides an environment where RSEs do have a path forward. As a community we should ask ourselves, “How can we create such possibilities for RSEs who are not working on such large projects?” An answer to this may be in the creation of larger multi-project, multi-institution virtual organizations of RSEs with opportunities for advancement across suites of virtually managed projects without having to change employers. The HUBzero team and the Texas Advanced Computing Center are engaged in a virtual team experiment to test this hypothesis. These activities will be described in greater detail during the presentation.
Collaborative Container Modules with Singularity Registry HPC
Singularity Registry HPC, “shpc” (https://singularity-hpc.readthedocs.io/), is a collaborative framework to easily provide containers as modules for high performance computing (HPC) centers. Unlike registry servers that expect you to pull and manage a container directly, shpc provides a simple command line client to manage this interaction for you, and make container commands easily available as command line aliases. Along with providing over 200 containers as modules, ranging from neuroimaging analysis to bioinformatics to Nvidia bases for machine learning, shpc captures an opportunity for research software engineers from different kinds of institutions to collaborate on a unified effort. In this talk, I discuss this alternative model to providing a shared, collaborative registry, along with introspection about how different institutions like academic and national labs can better work together.
Refactoring Researcher-Developed Statistical R Packages: Technical Details and Challenges
Naeem Khoshnevis and Mahmood Mohammadi Shad
We present an R package (CausalGPS) refactoring process with its technical details and faced challenges. Many statistical codes and scripts are developed to prove the proposed method and are tested on relatively small data samples. Because of the lack of standard software engineering practices, these codes are not ready to be used by broader audiences. They are mostly being archived or operated just by the original authors (biostatisticians) for further development for new publications. Through close collaboration with researchers from the Biostatistics department at Harvard T.H. Chan School of Public Health, we profiled the code and reviewed it for numerical algorithm implementation, data structures, unit testing, logging, and documentation for both users and developers. We focused on modularized implementation, converted multi-objective functions into single-objective functions that can be tested separately, redesigned the implementation to reduce memory consumption, and parallelized the code in several places. We held numerous meetings and workshops for the researchers to improve their engagement. The refactored package follows standard software engineering practices and is efficient in using available computational resources. We faced numerous challenges. The statistical packages are primarily dependent on third-party packages. As a result, we are limited with internally used packages, and modifying the approach or substituting the package may require new numerical implementation. Keeping the balance between the best software engineering practices and commonly used R code development approaches is another topic of interest that improves future development by both software engineers and statisticians.
Spack Configuration Manager: Automating Toolchain Installations
Joe Frye, Miranda Mundt, Henry Swantner and Jon Pellegrini
As any research scientist can attest, essential to the development of scientific software is the use of third-party libraries. Equally as critical is the availability of consistent third-party library toolchains across varied architectures.
Since July 2020, members of the Software Engineering and Research Department at Sandia National Laboratories have been developing a configuration manager (built on top of Spack) that enables the automation of installation of third-party library toolchains and creation of modules to easily load appropriate toolchains across multiple platforms and architectures.
Written in Python, the configuration manager is a wrapper around Spack that utilizes spack python scripting and abstracts away the more difficult implementations of Spack logic. Rather than spending precious time and money to learn how to properly use Spack, this tool presents users with a single manifest file that is injested to generate the configuration files necessary for Spack to install and create modules for third-party libraries, thus lowering the barrier for entry into the tool’s usage. Additionally, the infrastructure allows the generated files to be saved and shared, acting as a simple means for collaboration across different projects. This also allows for platform-specific configuration options and the capability to automate installation workflows, which in turn contributes to its re-usability across platforms.
Though the tool is currently only available internally at Sandia, this talk will discuss the requirements and specifications that were built into the tool, along with general and specific use cases within Sandia National Laboratories and eventual open-sourcing goals.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of NTESS, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04 94AL85000. SAND2021-5275 A
LoadEnv: Consistently Loading Supported Environments Across Machines
Jason Gates, William Mclendon, Josh Braun and Evan Harvey
When working in the space of developing computational simulation codes on next generation architectures, a critical component of your workflow is establishing a reproducible environment before you do anything else. Not only do you need the various compilers and third-party libraries available, but you also need a means of consistently loading a given toolchain, and easily switching from one to another, to ensure your code functions correctly and your results are reproducible. Assuming the existence of a mechanism to provide the toolchains, e.g., Spack or Sandia’s Spack Configuration Manager, how do users consistently establish the environments you support across their machines of interest? How do you as a code team communicate to your user/developer community which environments are available on which machines? How do you empower that community to stand up and contribute new environments for consideration? These are some of the questions Sandia’s Software Engineering and Research department sought to answer in developing the LoadEnv tool.
Written in Python for its documentation, unit testing, and code style conventions, the LoadEnv tool is a package, made up of bite-sized modules, that conceptually does something very simple: load modules and set environment variables. The difficulty arises in creating a tool that is easy to understand, use, and modify by scientific developers who don’t have time or inclination to understand the intricacies of establishing reproducible environments made up of complex software stacks on various next-generation hardware. The tool’s strengths include:
- Allowing personalization of the tool via a small number of plain-text configuration files without touching the driver code itself.
- Facilitating inter-team communication on standing up new, deprecating, or deleting old environments with those configuration files and pull requests. This addresses existing issues involving chasing down the same issues cropping up in different ways, challenges reproducing issues, etc.
- Making it simple to build both collaboration on and sharing of environments directly into team workflows. Environments and associated changes can be clearly communicated and version controlled.
- Seeking to prevent the user from accidentally doing something they didn’t intend, while allowing flexibility to bend the rules, if necessary.
- Focusing on long-term sustainability through modular design, thorough documentation, and complete unit test coverage.
Though the LoadEnv tool is currently only in alpha-testing, the talk will demonstrate its usage both at the command line and as a means of facilitating team communication over supported environments.
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of NTESS, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04 94AL85000. SAND2021-5328 A
Logger: A Tool for Keeping Track of Python’s Interactions with the Shell
David Collins, Josh Braun and Jason Gates
Python is an ideal scripting language due to its documentation, unit testing, and code style conventions, all of which, if adhered to, contribute to the long-term sustainability of tools written in it. However, if you’re in the business of loading environments, cloning repositories, and then configuring, building, testing, and installing code, you necessarily have to interact with the shell. Any time you do, it’s worthwhile to collect various meta-data to ease debugging when things go wrong. The Logger tool was built to capture basic information with each shell interaction (what was executed, where, when, for how long, why, along with stdout, stderr, and return code), plus additional diagnostics if you so desire (s/ltrace, CPU/memory/disk statistics, etc.). These data are then used to generate a HTML log file containing (hopefully) all the information a user/developer might need to troubleshoot a problem. If you’re familiar with the Unix script command, this is similar in principle, but with substantially more functionality. If you’re familiar with Python’s logging module, the motivation is similar, but this intends to capture what’s happening in the shell rather than in Python itself.
If one were to create a build script, for example, there would be a top-level Logger object followed by several “child” Logger objects for each stage of the process (clone, build, test, etc.). These child Loggers would each appear in the main section of the HTML file in collapsed form, with a small label attached with the duration of that child’s commands. This helps debugging by allowing developers to see at a glance where bottlenecks may be occurring. If you need to troublshoot something with a colleague, or have them replicate work you’ve done, simply provide the HTML log file and they can see exactly what everything looked like on your system. In general, Logger:
- Helps developers sift through the swaths of script output to find the information they care about in a clean interface.
- Facilitates communication and collaboration between team members when debugging failures.
- Improves replicability by supplying developers with all the necessary information in one place.
This talk will introduce the basics of the tool and how to integrate it into existing Python scripts.
Sandia National Laboratories is a multi-program laboratory managed and op- erated by Sandia Corporation, a wholly owned subsidiary of NTESS, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04 94AL85000. SAND2021-5329 A
Breakout Discussion Topics
Building Effective Research Software Engineering Teams
Mahmood Shad and Scott Yockel
Research Software Engineering (RSE) is a new initiative being adopted in several institutions and national labs worldwide. The formation of RSE teams comes with several challenges, including funding, lack of skills, and teamwork on different RSE topics. In this breakout session, we discuss the approaches in making effective RSE teams, utilizing skills on various projects, and avoiding single-developer issues. Additionally, we introduce areas of RSE that can be applied to most projects.
RSE Group Leader Breakout Discussion
The recent growth and organizational acknowledgement of Research Software Engineers along with RSE career paths has resulted in the emergence and proliferation of an additional new role: the RSE Group Leader. Individuals in this position are often tasked with establishing new RSE groups, hiring/managing RSEs, and interacting/communicating on RSE matters with researchers and administrators. This breakout session will focus on topics relating to this by targeting individuals currently in the RSE Group Leader role or anticipating being in the role in the near future.
Goals for this breakout session are to connect individuals, foster new relationships, and share knowledge and experiences. The breakout session will help identify key struggles of current RSE Group Leaders and opportunities for US-RSE to better support this important role. If participants are interested and see value, a long term outcome could be the formation of a regular RSE Group Leaders meeting.
Potential topics for discussion include:
- How did your group start, how large is it now?
- What did/do you struggle with?
- What would you consider a recent success?
- What questions do you have for your peers?
- What resources or information would help you as an RSE Group Leader?
Finishing the RSE White Paper
Julia Damerow, Chris Hill
In the March 2021 Community Call, we discussed the topic “Changing how Academia views RSEs.” We have written up the discussion in a white paper draft. However, before we can publish the whitepaper, there are a few more discussion points that should be added and we need to decide how and where the whitepaper should be published. In this breakout discussion, we would like to discuss the missing points to be added to the draft and plan the next steps to get the whitepaper published.
What else can US-RSE do for the community?
The US-RSE Association is centered around three main goals: (1) supporting the community, (2) promoting RSEs’ impact on research and (3) providing useful resources to multiple demographics. Our current activities include:
- Outreach via Slack, email, website, Twitter, newsletters - the job board is an example for a successful implementation of content
- Working groups on Diversity, Equity and Inclusion, Training and Education, Website
- Events include community calls, workshops, DEI speaker series, DEI book club, Annual General Meetings
- RSE podcast guided by Vanessa Sochat
- Financial structure
- Presenting US-RSE at related events
- Writing of white papers
- Participation in RSE International with event and survey
- Steering committee election per year (staggered to secure continuity)
The community growth is encouraging that these activities have an impact. There is always room for improvement, of course!
What are activities or topics that are missing? What would be additional activities or content for the website that is desired? Should there be additional events? Is there something we totally miss out on?