Tutorials
Tutorials will be online only and conducted virtually in the weeks prior to the conference, Monday, October 2nd to Friday, October 13th.
GitHub Actions for Scientific Data Workflows
Valentina Staneva, eScience Institute, University of Washington
In this tutorial we will introduce GitHub Actions as a tool for lightweight automation of scientific data workflows. GitHub Actions have become a key tool of the software development lifecycle, however, many scientific programmers who are not involved in software deployment may not be familiar with their functionalities and/or do not know how they can be applied within their data pipeline. Through a sequence of examples, we will demonstrate some of GitHub Actions’ applications to automating data processing tasks, such as scheduled deployment of algorithms to streaming data, updating visualizations based on new data, model versioning and performance benchmarking. For the demonstration we will access a public hydrophone stream and compute and visualize statistics of sound patterns. The goal is that participants will leave with their own ideas on how to integrate GitHub Actions in their own work.
Prerequisites: GitHub account, basic familiarity with git, GitHub, and version control, programming in a scripting language such as Python/R
Audience: scientific programmers interested in automating components of their workflows through existing tools for software continuous integration/deployment.
Introduction to Spatial Data Processing
Nick Santos, University of California, Merced
This tutorial provides an introduction to processing spatial data. The goal is that participants leave the workshop with an understanding of and ability to use spatial data types, coordinate systems, and basic data processing with spatial joins and zonal statistics. These processing methods allow for a wide variety of data manipulation and aggregation. Participants will also learn to visualize data and results in basic maps. Though introductory, the tutorial is designed to teach a skill that is important for many areas of research, but which may be new to some RSEs.
The tutorial will use Google Colab or Jupyter notebooks and be primarily hands-on, with short lectures only to describe core concepts with graphics. The workshop will use Python with the geopandas and rasterstats libraries but will emphasize concepts and topics that can be applied in any language or computing environment that supports spatial data. Participants will only need access to a web browser.
Prerequisites: Required background knowledge includes:
- Being comfortable with tabular data and the concepts of table joins (e.g. with SQL, data frames, etc)
- The ability to read Python code. Participants will modify and run code snippets, but won’t need to write Python code without using an example as a base.
Audience: research software engineers who are already familiar with tabular and/or image data but do not yet have experience with the characteristics and requirements of spatial information.
Publish your software in conda-forge
Dave Clements and Valerio Maggio, Anaconda
Conda is a widely used (30M+ users) multi-platform (Linux, macOS, Windows, …) and language agnostic packaging and runtime environment management ecosystem. This workshop will be a worked, hands-on tutorial demonstrating how to publish your open source software packages in conda-forge.
conda-forge hosts over 20,000 packages and serves over 3 billion package downloads per year. It is the largest community managed conda channel in existence and it is an excellent platform for making your software easy for others to install and integrate with other open source tools.
In this tutorial we will:
- 0:20 - Introduce software packaging concepts and challenges
- 0:20 - Introduce the conda ecosystem
- 1:30 - Walk through how to prepare a sample software package for publishing in conda-forge from scratch
- 0:10 - introduce example package and it’s dependencies
- 0:15 - adding tests
- 0:50 - defining your package recipe in a meta.yaml file.
- 0:15 - Building your package with conda-build
- 0:20 - How to submit your package to conda-forge and shepherd it through to publishing 0 0:30 - How to port packages that are already in PyPA/pip (Python) or CRAN (R) to conda using Grayskull
At the end of the tutorial participants will have a basic understanding of software packaging, how conda implements it, and how to prepare and publish your packages in the conda ecosystem.
Prerequisites: Participants will need either a Linux or macOS laptop, or a Windows laptop with WSL. Laptops will need a web browser, shell access, a text editor program, and git and/or a GitHub client already installed. Participants should have experience with the command line, a text editor, and GitHub. No prior package creation knowledge is assumed.
Audience: software engineers with some experience incorporating software dependencies in their work.
Dave Clements is an open source community manager at Anaconda, and has been involved in training and teaching throughout his career. Most recently, he led training efforts at the Galaxy Project for over ten years. Before that he had a similar role at the GMOD Project, was adjunct faculty at the University of Oregon, taught courses while in graduate school, and developed and presented training to programmers and end users at a fortune 500 company.
Software Quality Practices for Reproducibility
Reed Milewicz and Miranda Mundt, Sandia National Laboratories
In this tutorial, participants will learn about evidence-based software engineering strategies for addressing reproducibility across the software lifecycle. The tutorial will center around three interrelated topics:
- Setting software quality priorities around reproducibility
- Tailored software development practices that facilitate reproducibility
- Software process improvement techniques for incrementally introducing those practices into teams’ workflows
Participants will receive instruction on managing software quality priorities with regards to reproducibility, hear insights from real-world teams on practices that facilitate reproducibility, and finally will learn how to take concrete steps toward improving those practices within their respective projects based on the Productivity and Sustainability Improvement Planning (PSIP) toolkit which the presenters previously helped develop and pioneer for use with teams in the Exascale Computing Project.
The course content represents a living curriculum based on the organizers’ ongoing research with real-world teams into software quality practices for reproducibility. Organizers will solicit feedback on how to improve upon or add to the tutorial. The outcome of this session will be concrete steps that teams can take to improve their development practices with respect to reproducibility, and participants will learn some of the skills needed to approach their teams and precipitate process improvement.
Audience: research software engineers and other professionals responsible for supporting, developing, and maintaining the development and use of scientific and engineering software systems and workflows. This includes students and researchers as well as the core production practitioners.
The course is relevant both for people looking to learn more about best practices in engineering reproducible software and for those hoping to promote those best practices within their respective institutions.
Using Globus Platform Services in Research Software Applications
Lee Liming, Steve Turoscy, and Vas Vasiliadis, University of Chicago
Research applications increasingly need to leverage, and often orchestrate, diverse systems: campus data storage and computing systems, national data and computing centers, and data-generating instruments such as gene sequencers, Cryo-EM microscopes, CT scanners, and sensor networks. Unless these interactions are automated, this is time-consuming and wasteful of scarce human resources on research teams.
The Globus platform enables research applications developed by research teams to leverage data and compute services across many tiers of service—from personal computers and local storage to national supercomputing centers—with minimal deployment and maintenance burden. Globus is operated by the University of Chicago and is used by nearly all R1 universities, national labs, and supercomputing centers in the United States, as well as many smaller institutions.
In this tutorial, we’ll begin by introducing the Globus platform-as-a-service, including how to register an application and how to access Globus APIs using our Python SDK. We will present examples of how the various Globus services, interfaces, and tools may be used to develop research applications. We will walk participants through authentication and access control with Globus’s Auth and Groups APIs; making data findable and accessible using Globus guest collections, data transfer API, and indexed Search API; and automating research with Globus’s Flows and Compute APIs. Participants will use Jupyter notebooks to experiment with these capabilities, and they will also become familiar with the Globus web application.
This tutorial is hands-on, utilizing Jupyter notebooks on our cloud-hosted JupyterHub system, accessible via web browser.
The presenters lead projects—sponsored by NSF, NIH, and DOE—that are building research applications leveraging the Globus platform. Employed by the University of Chicago and Argonne National Laboratory, we have decades of combined experience with both building research applications and teaching collaborators how to build them.
Prerequisites: Tutorial participants should have beginner to intermediate familiarity with Python.
Audience: research software engineers supporting teams whose work needs to be scaled up (either to solve larger problems or to achieve a faster rate of smaller problems) using university research computing resources, national-scale systems (NSF, DOE, NASA), or cloud systems.