Objections, Continued

Posted by Jonathan Dursi on April 09, 2015 · 19 mins read

This is a crosspost from Jonathan Dursi, R&D computing at scale. See the original post here.

Thanks for all of the comments about my HPC and MPI post, on the post itself, or on twitter, or via email. While much of the comments and discussions were positive, it won’t surprise you to learn that there were objections, too; so I thought I’d keep updating the Objections section in a new post. I’ve also posted one (hopefully last) followup.

But do keep sending in your objections!

Further Objections

You’re saying we’d have to rewrite all our code!

If someone had suggested I add this objection to the original list before publishing, I would have rejected it as too straw-man to use; I’d be transparently putting this objection up just to demolish it. Clearly, no one would actually claim that “the HPC community should urgently start engaging with and using new technical computing technologies” means “you have to burn all your old stuff to the ground”.

But sure enough, it came up frequently, in private email, and most dramatically, on twitter.

Even though this is by far the most common reaction I got, I hope it’s clear to most readers these aren’t the same things. Learning (say) C++ and using it in development of new codes doesn’t mean your old C and Fortran stuff stops working. Or that you’re under an obligation to take the working code in other languages and re-write it all in the new language before ever using it again to maintain some kind of computational moral consistency.

Your MPI code won’t stop working for you in a fit of rage because you’re seeing other frameworks. MPI will continue to work and be maintained, exactly because there is 20+ years worth of stuff using it.

But new software projects are being started every day, in every field, in every region. This argument is about what we should use for those codes. “Because we’ve always done it that way” isn’t a great reason for a community that’s supposed to be on the cutting edge of computing to keep doing things in one particular framework.

Big data and HPC are completely different, and its ridiculous to compare them

This was a close second in popularity. And this one worries me quite a bit, because it means that there’s a lot of people in our community that are disturbingly unaware what’s going on in computing and data analysis outside of the confines of their office.

It’s absolutely true that there are Big-Data-y things that are mainly just I/O with a little bit of processing. But by and large people want to analyze that large amount of data. And then you end up with absolutely classic big numerical computing problems. To take an early example, Page Rank is, after all, an eigenvalue problem. The drive for next-generation big data platforms like Spark is no small part to make machine learning algorithms that would be very familiar to us run as efficiently as possible. Let’s take some example machine learning approaches:

  • Spectral clustering solves an equation for the graph Laplacian - which looks exactly like any other parabolic PDE on an unstructured mesh. (Thanks to Lorena Barba for pointing out an embarrasing mistake in an earlier version of that point.)
  • Support Vector Machines are kernel based methods which involve Green’s functions and 1st order integral equations.
  • Much of machine learning involves fitting a model, which means that there are entire books written about large-scale efficient optimization solvers for machine learning, including physical science chestnuts like gradient descent.
  • A common first step in data analysis is dimensional reduction involving (say) PCA, requiring the SVD (or similar factorizations) of huge matricies.
  • In fact, Linear Algebra is omnipresent in machine learning (as it has to be with so much, eg, model fitting), to the point that there are entire conferences on large-scale linear algebra for machine learning.
  • A lot of the data analyses involve statstical bayesian inference, requiring MCMC calculations.
  • k-Nearest-Neighbour problems in clustering, kernel density methods, and many other techniques relying on something like a distance or similarity metric require classic N-body solutions like k-D trees; and if positions are being updated, they essentially become N-body problems. And of course, an entire class of high-dimensional optimization problems often used in machine learning are essentially tracer particle methods.
  • As a result of all this high mathematical intensity, machine learning is of course becoming a rapidly growing user of GPUs for their numerical algorithms.

So let’s see; PDEs on unstructured meshes, optimization, gradient descent, large-scale linear algebra, particle methods, GPUs. And of course, time series data of any sort means FFTs. So sure, I don’t know what is running on your HPC cluster, but is it really that different than the above?

MPI is great for physics, even if less great for the other stuff

I got this by email and on twitter several times.

Great compared to what? And based on what evidence?

Say a physics grad student walks in to your office who’s going to develop a small bespoke particle code for their dissertation. Pointing them to MPI, rather than other technologies with unimpeachable HPC bona fides like UPC, Chapel, Co-array Fortran, or, (for a particle simulation especially) Charm++ seems like it’s the lazy, easy way for us, and less about what’s actually best for them.

In what sense is it “great for physics” to have the student increase the amount of code they have to write, and debug by a factor of 3x? In what sense is it great for them to have to re-invent all of the low-level communications algorithms which have been implemented better, in other packages? Maybe you could make an argument about stability or performance against UPC/Chapel (although I’d counter-argue you’d get immediate and helpful support from the developers) - what’s the argument against pointing the student to Charm++? Or Intel’s CAF?

And this doesn’t even begin to cover things like Spark, Flink, or Ignite - for simulation, or experimental physics work (which is physics too, right?), which is necessarily heavy on data analysis.

You’re just saying MPI is too hard

I’m really not. As a community, we don’t mind hard. Solving complex equations is hard, that’s just how it is. We eat hard for breakfast. (And the genomics and big-data communities are the same way, because they’re also filled with top-notch people with big computational problems).

I’m saying something different: MPI is needlessly, pointlessly, and uselessly a huge sink of researcher and toolbuilder effort for little if any reward.

How many grad students have had to tediously decompose a 2d or 3d grid by hand, write halo exchange code, get it debugged and running, run in that crude fasion for a while, then tried moving to move to overlapped communication and computation, and spent days or weeks trying to get that to work efficiently - and then had to re-write chunks as they need a new variable laid out differently (or just implemented a really bad transposition?) and still gotten performance that an expert would consider poor?

And regular grid codes are the easy stuff; how many scientist-decades worth of efforts have gone into implementing and re-implementing tree codes or unstructured meshes; and by and large resulting in efficiencies ranging from “meh” to “ugh”?

Wouldn’t it be better to have experts working on the common lower level stuff, tuning it and optimizing it, so that the scientists can actually focus on the numerics and not the communications?

The stuff about levels of abstraction isn’t some aesthetic philosophical preference. And I’m not complaining MPI because it’s hard; I’m complaining about it because it’s resulted in an enormous waste of researcher time and compute resources. Let the scientists focus on hard stuff that matters to their research, not the stuff that can be effectively outsourced to builders.

Now, we at centres could at least improve this dreadful state of affairs even with MPI just by doing a better job pointing researchers embarking on a code project to libraries and packages like Trillinos or what have you, and stop counseling them to write raw MPI code themselves. But of course, we normally don’t, because we keep telling ourselves and the incoming grad students “MPI is great for physics”…

It’s important for students to know what’s going on under the hood, even if they’re using other frameworks

I do have some sympathy for this point, I will admit.

But anyone who thinks teaching generation after generation of grad students how to manually decompose a 2d mesh and do halo exchange on it using MPI_Sendrecv() is a productive and rewarding use of time, is someone who doesn’t spend enough time doing it.

As with other pro-low-level arguments: why is MPI automatically the right level to stop at? If we want to teach students how things really work under the covers, why aren’t we going all the way down to Infiniband or TCP/IP, user mode and kernel mode, and the network stack? Or, why don’t we stop a level or two above, draw some diagrams on a whiteboard, and move on to actually solving equations? Why is MPI in particular the right “under the hood” thing to teach, as opposed to GASNet, Charm++, or just pseudo-network-code?

If the answer to the questions above is “because MPI is what we know and have slides for”, then we need to think about what that implies, and how well it’s serving the research community.

But my new code will need libraries based on MPI that aren’t supported by Chapel/UPC/Spark/other stuff yet!

Fair enough. When you choose what you are going to use to write a program, library and tool support really matter. It’s absolutely true that there are great packages that use MPI, and if your project is going to rely on them, then this isn’t an example of a good project to start expermenting with a new platform on. This is why such a large fraction of numerical code was in FORTRAN77 for so long.

Co-array Fortran, Chapel, and others do have various degree of MPI interoperability, so do check that out; but yes, you need what you need.

But people are starting to build things based on MPI-3 RMA!

This coment by Jeff on the original post, is by some measure the most interesting objection I’ve heard so far.

People are legitimately starting to use MPI-3 RMA in the underlying implementations of higher level tools. If that really took off, then my arguments about MPI not being the right level of abstraction for toolbuilders would clearly be wrong, and a huge part of my post would be rendered irrelevant.

In that case, I would be completely wrong – and it would be awesome! A higher-level toolset for researchers could finally flourish, the lower level stuff could be handled by a completely separate group of experts, and MPI would have found its place.

I want to be clear that I think it would be fantastic - really, the best of all possible worlds - to be wrong in this way.

I’m going to describe why I really don’t think I am, and what the stumbling blocks are. Then I’ll discuss an alternate future which sidesteps the worst of those problems, and how it really could be a path to a very productive and growing HPC future - but it will of course never, ever, happen.

So MPI-3 - useful RMA, being used. Perfect! To see the problem that concerns me here, consider two questions; (1) What are the benefits of using MPI for this, and (2) what are the downsides?

On the upside, it’s great that MPI is sufficient to implement these tools. But is it necessary? What is the advantage of using something like MPI over something else, and in particular something lower level? Maybe it would be a little easier or a little harder, but would it make a big difference? Particularly to the end-user of the tool being built?

I doubt it makes much difference either way; the reason I ask is the downside.

MPI-3 RMA doesn’t come on its own; it’s part of MPI. And in this context, I’m concerned with two real downsides with using even great parts of MPI for low-level toolbuilding. They’re related: the heavy-weight forum process, and the enormous baggage of backwards compatability.

Let’s take the forum process first. Let’s say there’s two competing tools you could use to build your next lower-layer tool; MPI-3 RMA and some other low-level network abstraction layer. (What I’m picturing is something like OFWG Libfabric, which you can probably tell I’m quite taken with, but that’s not really right here. But something at roughly that level or a little higher).

You’re starting to build your new tool, which contains a number of really innovative ideas; but now you’ve discovered you need one additional feature in either package.

Which will get you there first?

The MPI forum was really able to innovate with MPI-3 RMA, because they were nearly starting afresh - or at least complementary with what had gone before. But now that MPI-3 is out, and a number of projects have used it, the spec is essentially encased in carbonite; the API in its every last detail will outlive us all. None of the existing APIs will change.

That’s ok, because the Forum has shown its willingness to add new functions to the spec when justified. Your case sounds interesting; you should get your answer in a couple of years or so.

And that’s kind of crazy for a low-level network abstraction layer. The other package - whatever it is - won’t have that sort of friction.

There’s another issue in terms of new features; that’s the backwards compatability legacy.

Let’s take something like fault tolerance, which is important at extreme scale - but will eventually get important for more moderate scales, as well.

For a really low-level network abstraction, dealing withfault tolerance isn’t an enormous difficulty. For something higher level like MPI-3 RMA, it’s more challenging, but it’s still something where one could imagine how it might go.

But for MPI-3+ to develop a feature like fault tolerance, it will have to be created in such a way that it integrates seamlessly with every single MPI feature that has ever existed, without altering any of the semantics of a single one of those calls. The backwards compatability requirements are crushing.

So this is sort of the tragedy of MPI-3 RMA. It’s a great thing that may have just come too late in the lifecycle of a project to be able to have its full impact.

Let’s imagine a world where we could just shrug this stuff off. Let’s imagine a new framework – MPING, MPI++, whatever; which is a substantially paired down version of MPI. It’s an MPI that has decided what it wants to be; a low level layer for toolbuilders, never to be taught to grad students who are planning to write application software.

It contains only pared-to-the bone versions of MPI3 RMA, which are demonstrably being found useful; MPI collectives, which are fantastic; MPI-IO, which is also fantastic; and auxiliary stuff like the datatype creation routines, etc. The communications semantics for everything are greatly relaxed, which would confuse the heck out of newbie end users, but toolbuilders can deal with it. And there’s no decades of backwards compatability to fight with.

This vision actually discourages me a bit, because it would be terrific; there’d be an active, vendor-supported, high-performance, productive network abstraction layer for toolbuilders; and no confusion about who it was for. We could build high-productivity tools for scientific application writing atop a stable, high performance foundation.

And of course, it will never, ever, happen.