This is a crosspost from Jonathan Dursi, R&D computing at scale. See the original post here.
At the Simpson Lab blog, I’ve written a post on streaming vs random access I/O performance, an important topic in bioinformatics. Using a very simple problem (randomly choosing lines in a non-indexed text file) I give a quick overview of the file system stack and what it means for streaming performance, and reservoir sampling for uniform random online sampling.