What Should a National Research Computing Platform Be?

Posted by Jonathan Dursi on January 19, 2019 · 17 mins read

This is a crosspost from Jonathan Dursi, R&D computing at scale. See the original post here.

What is a National Research Computing Platform For in 2019?

Computers are everywhere now, but computing is still hard. Canada should build on its competitive advantage by strengthening existing efforts to provide expertise, skills and training to researchers and scholars across the country, and let others provide the increasingly commodity hardware. The result will be a generation of trainees with deep research and cloud experience, and a critical mass of talent at centres focussed on building enabling technologies.

As R&D becomes increasingly intertwined with computational techniques, the need for advanced R&D computing support to power research and scholarship has grown enormously. What that support looks like, however, and the kind of services that researchers most need, has changed radically over the past decades.

The history of providing computers for research

In the 1990s and 2000s, the overwhelming need was simply access to computers. With no other providers for computing or storage, it fell to individual research groups to supply their own. But a natural economy of scale starts to play out with computational resources. Purchasing and operating hardware becomes more cost-effective in bulk; and what was even then the most scarce and valuable resource - the expertise to operate and make effective use of the hardware - actually grows, rather than is diminished, by being involved in different research problems. So quickly individual researcher “clusters in a closet” gave way to departmental, then institutional, and finally regional or national platforms for computational research and data science support. In Canada, the vast majority of such support is offered through Compute Canada.

As we enter 2019, this landscape looks quite different than it did in the 90s. Computing resources adequate for research are thick on the ground. Indeed, as the range of problems researchers tackle with computing and data broaden, many extremely active areas of compute- and data-powered research require nothing more than a powerful desktop.

And for larger needs, the unavoidable logic of economies of scale for computers and storage has now entered the marketplace. A competitive range of commercial vendors provide access to computing resources that can meet the vast majority of other researchers needs. While it’s true that those commercial cloud providers charge a premium (50%-100%, slowly declining over time) over what it costs to provide the resources in academic research environments, that premium pays for enormous benefits in improved uptime, flexibility, and currency of the hardware, all of which have real value for researchers. Increasingly, even niche technologies like FPGAs, RDMA-enabled networking, and ARM processors are readily available on commercial cloud providers, leaving fewer and fewer use cases where in house provision of computer resources remains a necessity. Those use cases are important — they include multi-rack HPC users, and the stewardship and analysis of data with the strictest regulatory on-premises requirements — but they represent a minority of computational science needs.

The need for higher-level support

We advance research more powerfully by providing clarity than clusters But even while computers for research become ever more accessible, research computing   for cutting edge research remains a barrier to too many. Scientists and scholars are trained to be experts in their field, not necessarily experts in computer science or the latest computer hardware. Even keeping track of the latest computational methods, which frequently come from neighbouring fields if not different disciplines entirely, can be a challenge. Researchers greatly need assistance from and collaborations with experts in research computation itself. It is the skills, not the infrastructure, that is scarcest.

The good news is that the Compute Canada federation has a network of roughly 200 computational experts, many at the Ph.D. level, available to directly enable science projects. The bad news is that the priorities of the organization, and thus most of its effort and energies, are focussed on procuring and operating on-premises commodity computing and storage hardware - to the extent that many of those experts spend most of their time answering basic help-desk questions or performing routine operational duties for those systems.

What should today’s R&D computing support focus on?

With academic institutions now being just one player amongst many for computing and storage resources, there are a few possible futures for Canada’s computing centres – centres that have grown up primarily focused on purchasing, operating, and providing access to hardware for researchers. They could downsize, shrinking to focus on those sorts of hardware not well covered by other providers. Alternatively, they could double down on the “discount provider” model, emphasizing low cost, ‘no frills’ access to compute and storage, competing on price.

Either of these approaches represent a scandalous squandering of opportunity, wasting invaluable and nearly irreplaceable expertise and experience in applying computational techniques to open research problems. Instead, we should do something different. We should pursue our competitive advantage by taking the existing network of computational science advisors that we already have and make those higher level expert services the primary offering, letting other providers focus on the lower level procurement and operating of most computing and storage hardware.

Skills beat hardware

The goal of a research computing support platform is to enable research, and to help develop the next generation of research talent. Knowledge transfer and skills development are by far the most valuable work that a computing team can to to meet those goals - because skills have longest lasting impact, because it addresses real needs in Canada’s R&D ecosystem, and simply because no one else can do it at scale.

First, deep training with research methods pay long-lasting dividends. Even in a rapidly changing fields like data and computational science, skills and experience don’t depreciate the way computing hardware does. New methods come, but old methods don’t really go; and fluency in the previous generation of methods makes learning – or even creating – those newer methods easier.

And it’s actually even better than that, because not only do the skills that come from that research experience and training remain useful in their field for long periods from time, they transfer to other disiplines extremely well. Methods for solving equations, or pulling information out of data, have strong relationships with each other and can often be applied with modest modifications to problems well outside the fields in which they were first developed. These broad areas of effort - Data Science, Informatics, Simulation Science, and the Data Engineering or cloud computing tools needed for them - are enabling research technologies which can empower research in many fields. And there lies the second reason for the importance of the skills devevelopment; these research-enabling technologies are areas in which Canada currently lags. A recent report on the State of Science and Technology and Industrial R&D specifically calls out “enabling technologies” as a current area of weakness for Canada which is holding high impact research in other areas back. Focussing on such highly transferrable skills and talent development in our research computing platform would help build a critical mass of such expertise both in the research computing centres themselves and in the community as a whole.

Finally, there just aren’t other options for providing high-level data and computational science collaboration and training to Canada’s scholars and researchers consistently and across disciplines. We in the research community know that availability of a collaborator with complementary interests and skills can make the difference between a research project happening or not. Unlike access to commodity computing hardware, the skills involved in making sure researchers have access to the best methods for their research, and in training emerging research talent in the computational side of their discipline, are very much not commodity skills, and cannot be purchased or rented from somewhere else.

The cloud premium is a price worth paying

The benefits of further efforts in skills development and training are fairly clear, and this alone would justify redirecting some effort from hardware to research services, and using comercial cloud providers to fill the gap. But having substantial commercial cloud resources available for researchers is worthwhile on its own merits.

Firstly, cloud provides more flexibility for rapidly changing research. The resource mix can be much broader and change much more rapidly than traditional procurement cycles would allow; what’s more, those changes can be in response to demonstrated researcher needs, rather than making predictions and assumptions about the next five years based on existing research users. Like owning systems, dynamically taking advantage of this flexibility requires top operational staff. And the uptime availability and hardware currency of these resources will generally be significantly better than what can be provided in house.

Secondly, trainees and staff benefit from gaining extremely relevant commercial cloud expertise. This goes back to skills development a bit, but in this case it’s the system tools – the experience working with commercial cloud services and building data systems solutions using them – that are valuable in and of themselves, and will be attractive skills to have in whatever career they move on to.

Finally, commercial engagement can proceed much more smoothly, and be more attractive from the point of view of the commercial partner, when the collaboration happens in the commercial cloud. The success of efforts like Uber Cloud provides some validation of this. Most companies that would participate in such engagement either already have or are planning commercial cloud projects, and are likely more comfortable with such offerings that using academic systems.

How to proceed

Making significant changes to priorities and indeed how we provision basic services can seem daunting. It may not seem clear how to get there from here, but there are some basic approaches and guidelines that can help.

No need to do it all at once
This is a change that can and should be made incrementally. A team can be quite straightforwardly trained at a new, small, “national site” to provide access to a slowly growing range of cloud resources. This can start as a modestly scaled pilot, expanding in response to researcher needs.
Make the hardware you own really count by advancing the mission
Many hardware needs are readily outsourceable, whether to commerical entities or by “buying in” with other academic R&D computing partners. However, some resources will likely stay in-house. The way to choose is to ensure that every decision to own rapidly-depreciating, expensive-to-operate equipment directly supports the mission of excellent research support and research skills development. In-house equipment should be significantly better at that mission than what can be procured from elsewhere. That may mean making cutting-edge infrastructure that is in itself publication worthy, or buying still-prototype experimental systems to evaluate, and to build and share expertise on.
Use the right tools for the job
Helpdesk requests and fixing software bugs both are short-term tasks that benefit from a “ticket tracking” approach; an issue is identified, someone fixes it and “closes” the ticket, and the faster the ticket is closed, the better the service was. That isn’t the right way to think about higher-level services like collaborations and knowledge transfer, and using tools for one to manage interactions like the other distorts both the tool and the interactions. Consulting firms use case managment software, not ticket trackers, to manage engagements, and use the effectiveness of the collaboration rather than the duration of the engagement to judge success. Since interactions with the researchers are vitally important to the success of the mission, the best available case management software (and helpdesk software where appropriate) should be used.
Make the expertise really count by building a unified national team
Once the right tools are in place, other lessons can be learned from successful consultancies. The most successful collaborations will combine staff from across the country with the appropriate expertise, and staff that are local to the researcher. To achieve that, the computational experts across the country must be able to find each other, self-assemble into teams as needed, and collaborate seamlessly. While the technical infratructure for this exists, the organizational incentives are still for staff at a site to support primarily “their” researchers. Such siloing is completely counter to supporting national research.

Summary

The goal of a research computing support platform - any research support resource, really - is to enable research, and to help develop the next generation of research talent. With that primary mission in mind, the reasons for focussing the time and effort of computational science experts on collaboration and skills development rather than operating commodity hardware could not be clearer:

  • Collaboration across disciplines - domain science and computational/data expertise - enables better Canadian research;
  • Computational and data skills maintain their value, while hardware rapidly depreciates; and
  • Building a critical mass of expertise and talent focussed on emerging data science and computational methods will strengthen Canadian competitiveness not just in research but in innovation.

There are costs to this approach; it will cost somewhat more to have someone else run much of that hardware. But even those costs have upsides:

  • Cloud provides more flexibility for rapidly changing research; capability mixes and system configurations can be changed much faster than hardware procurement cycles;
  • Commercial cloud infrastructure provides much better uptime and currency for researchers;
  • Both the computational experts and the research trainees benefit from gaining extremely relevant cloud expertise that will benefit them in any future career; and
  • Industrial engagement will be much more straightforward around commercial cloud providers than academic infrastructure.

The prospect of moving to such a different service model may seem daunting, but it needn’t be:

  • Move one step at a time, with a new, small, “national site” being a collection of cloud resources;
  • Not all hardware can be outsourced; make what you do retain an ownership stake in count by having it be best-in-class, enable experimentation and development of new approaches, or otherwise having owning it rather than renting it directly advance the mission;
  • Choose the best possible tools for staff/researcher interactions; and
  • Build the best possible computational science team by having them collaborate internally, as well, and ensuring researchers and trainees get the most relevant help and collaboration possible.

These changes will not be easy; they will require participation from funders, staff, researchers, and all stakeholders. But the research computing world of today is not that of the 1990s, and how we support computational research should take advantage of that.

Images courtesy of shutterstock and pixabay, used under license