Hadoop For HPCers

Posted by Jonathan Dursi on September 04, 2014 · 1 min read

This is a crosspost from Jonathan Dursi, R&D computing at scale. See the original post here.

I and my colleague Mike Nolta have put together a half-day tutorial on Hadoop - briefly covering HDFS, Map Reduce, Pig, and Spark - for an HPC audience, and put the materials on github.

The Hadoop ecosystem of tools continues to rapidly grow, and now includes tools like Spark and Flink that are very good for iterative numerical computation - either simulation or data analysis. These tools, and the underlying technologies, are (or should be) of real interest to the HPC community, but most materials are written for audiences with web application or maybe machine-learning backgrounds, which makes it harder for an HPC audience to see how they can be useful to them and how they might be applied.

Most of the source code is Python. Included on git hub are all sources for the examples, a vagrantfile for a VM to run the software on your laptop, and the presentation in Markdown and PDF. Feel free to fork, send pull requests, or use the materials as you see fit.