Comments on: DNA Sequencing: Not Quite HPC Yet

By: biostructure

biostructure — Tue, 14 Feb 2017 07:51:44 +0000

Quite agree, not only sequencing, biotechnology has brought so many changes and improvements to our life.

By: Glenn K. Lockwood

Glenn K. Lockwood — Wed, 08 Apr 2015 06:30:57 +0000

In reply to Jonathan. Thank you for taking the time to craft such a thoughtful response, Jonathan. I was not trying to make a case that genomics needs to conform to traditional HPC; technology exists to serve the needs of science, not vice versa. Rather, the point upon which I was trying to elaborate is that the current computational demands coming from DNA sequencing are fundamentally different from traditional HPC problems. Cyberinfrastructure optimized to address the needs of bioinformatics--that is, those optimized for data-intensive problems--look very different from architectures optimized for compute-intensive problems, and while wedging these data-intensive bioinformatics problems into compute-optimized architectures is possible, the results demonstrate an exercise in underutilizing resources[1].

There’s no big data analysis pipeline anywhere in any field that works as one monolithic process – including some pipelines we’re happy to claim as being part of the HPC community when it suits us, like that for CERN LHC data, or for upcoming astronomical experiments like SKA or ALMA.

You raise a very interesting point here--bioinformatics operates around data analysis pipelines. I admittedly was thinking more along the lines of how traditional HPC codes have integrated a multitude of features to create data generation pipelines. Consider my LAMMPS example--it is a single application that allows you to orchestrate complex computational experiments by creating an atomic lattice, relaxing it, deforming it, heating it, ablating it, and whatever else you could imagine. All of this can be done by writing a simple input recipe that describes every operation to be performed. These codes are not modularity or maintainability nightmares, and they only have to run the parts of the recipe that are relevant. There is no unnecessary recomputation, and moreover, there is no compulsory flushing of data to disk between every single step of the recipe as in bioinformatics pipelines. By integrating a number of tools into a parallel framework that handles domain decomposition and parallel bookkeeping for the researcher, both performance and productivity benefit. Of course, the sort of interactive analysis you describe--probing the same set of data from different angles--benefits less from this level of application integration. In addition, the standard approaches to processing genomic data haven't stabilized to the degree that the methods baked into LAMMPS have, and this is a perfectly reasonable explanation for why highly integrated bioinformatics codes haven't become the norm. This all points to what I was trying to convey in the article--DNA sequencing isn't HPC...yet.

There is much more actively developed, well-architected bioinformatics code on github where anyone can file issues and contribute than there is, say, for fluid dynamics or astrophysics or particle physics.

My position on the quality of bioinformatics codes is undoubtedly colored by my personal experiences. While there are a number of high-quality codes out there, there are a larger number of widely used codes that speak to my point. A few anecdotes: 1. Perhaps the most popular SV caller today ships with a Makefile that hard-codes its author's home directory as the location for a dependent library. 2. The de facto universal way to represent aligned short reads, the BAM format, is not a supported output format for the most popular short read aligners, BWA and bowtie2. These aligners continue to support only outputting As, Ts, Gs, and Cs as ASCII. 3. Try to sight-read the core computational kernel of BWA to get an idea of the clarity of leading codes in the field. Its naming conventions remind me of a time when variable names could not be longer than six characters.

Yes, running thousands of 1-core jobs is different than running a few 1000-core jobs, and that changes the requirements for the cluster a bit; but again, almost all LHC work consists of single-core jobs and I don’t see those projects being chided for it.

Chiding a community for using single-threaded jobs is silly if that approach works well. However, the LHC project's needs are self-identified as grid, not high-performance, computing. In this sense, the computing needs of DNA sequencing are quite similar to those of the LHC in that they represent high-throughput, not high-performance, workloads. However, the LHC (and similar international physics experiments) have the benefit of having a small number of massive datasets that can be replicated across national data hubs and cached at the campus level to address the needs of all the researchers participating in the effort. By comparison, a given genome is more often examined by a very limited set of researchers, so grid-scale resources are rarely at the disposal of (or effective for) sequencing workloads.

I get that it would be easier for the sysadmins if all that innovation would just stop, so they could install one package and be done with it, but again, it’s HPC’s job to support researchers needs, not vice versa.

I'm not sure where this sentiment is coming from, as installing all manner of wacky software to support a variety of domains is a fundamental part of operational HPC. Rather, bioinformaticians are the ones tripping over their shoelaces and complaining loudest about this lack of standardization. And the larger problem is that the libraries that are supporting these file formats, if they even exist, all come with their own nuanced interpretations of the "standard." For example, there are as many different flavors of VCF files as there are VCF parsing libraries, and there's no guarantee that the parser de jour for a given language can grok the VCF being output by a caller that, by the letter of the standard, is fully compliant.

This research community isn’t dumb. If they’re reluctant to use common HPC tools, like say MPI, there’s probably a reason for that.

I have to disagree on this point. I see only technological and productive disadvantages in implementing multithreaded loops using sloppily written pthreads over OpenMP. For example, pthreads does not support performance-critical features such as specifying thread affinity, whereas OpenMP has now standardized this. In fact, OpenMP is a perfect example of an API that moves HPC in the right direction in a way not dissimilar from high-productivity languages like Chapel. The fact that bioinformatics codes continue to eschew these advancements is a testament to the progress that remains to be made in the sequencing industry relative to the rest of computational science.

And that, for HPC, is the danger. If we keep telling genomics researchers that what they’re doing isn’t HPC, eventually they’ll have no choice but to believe us – and to move to communities and technology stacks which do meet their needs.

This is a curious statement--HPC does not need to be sold to any scientific communities; it either enables scientific advancement or it doesn't. The purpose of this article was to simply describe the state of the world, and that state is that DNA sequencing workloads' needs do not currently require HPC workloads. If extra-HPC technologies best suit the needs of the sequencing industry, then there's no point in arguing otherwise. For the same reason though, the day a major core facility adopts ADAM whole-hog is the day I'll eat my hat--as you said, there's probably a reason it hasn't taken off. But that is a discussion for another day.

By: Jonathan

Jonathan — Tue, 07 Apr 2015 16:24:16 +0000

Glenn:

I’ve read many of your pieces in a number of venues, they’re always well-thought out and researched. But this one manages to be both carefully researched and yet still kind of off-target.

It’s not genomics researchers responsibility to support existing HPC setups – which are supposed to be in the service of researchers, not vice versa. Fitting nicely into a particular technology stack is something that may to be their benefit, or not, and the growing consensus in the field is that in this case, it’s not. The issue here is less about genomics not being ready for HPC and more about the HPC community not being ready for genomics. And I worry that articles like these make that second point more strongly than the first.

Let’s work backwards:

In reality, the end-to-end process of turning raw data into high-quality aligned mappings and called variants should be recognized as a single logical process and optimized as such. The current method of running a dataset through pipelines of discrete applications (each with their own idiosyncrasies) is fundamentally inefficient.

I’m an ex-simulation guy myself, coming from astrophysics, through HPC, and now in genomics. So I understand where the above sentiment comes from – but it’s madness. There’s no big data analysis pipeline anywhere in any field that works as one monolithic process – including some pipelines we’re happy to claim as being part of the HPC community when it suits us, like that for CERN LHC data, or for upcoming astronomical experiments like SKA or ALMA. Besides it being a modularity and maintainability nightmare, and resulting in unnecessary re-running of parts of analyses all the time, it just would not help. Different stages of the data analysis process are just fundamentally different, and have different parallelism, memory, and data access requirements; and it often makes sense to run the same stage of an analysis in a number of different ways, if only to demonstrate robustness of novel results (which are much more common in a new field like genomics and bioinformatics than in a much more mature one like physics), or to tackle different sorts of data (like one where a reference genome is available vs where one isn’t).

The “software quality remains low” bit is just scurrilous, and it conflates two completely different issues. First, I disagree with the basic premise. Bioinformatics and genomics seems to have professionalized its software/tool-building community very quickly, and as far as I can tell it is well ahead of where certainly much of physics is. There is much more actively developed, well-architected bioinformatics code on github where anyone can file issues and contribute than there is, say, for fluid dynamics or astrophysics or particle physics.

And this is a completely separate issue from the fact that there are a bunch of end users using single-core python or R scripts to do some final stage analysis. This is great, and again isn’t that different than LHC data, or (say) re-analyses of the Millennium Simulation, a huge astrophysics “hero-calculation”particle simulation which made its results available, generating hundreds of novel (and much less compute-intensive) analyses. This happens, and it’s a feature, not a bug – a big experiment or simulation produces a big data set which can then be interrogated by lots of end users using lots of approaches which are experimental (“Hmm, let’s see what happens if we look at this”) and so are mostly one-offs unless they prove interesting enough to run again and again. Sure, a lot of the one-offs are crummy, in whatever language, but it’s a sign of a healthy ecosystem where a lot of great questions are being asked for the first time – and that’s how science gets done. Yes, running thousands of 1-core jobs is different than running a few 1000-core jobs, and that changes the requirements for the cluster a bit; but again, almost all LHC work consists of single-core jobs and I don’t see those projects being chided for it.

A lot of the other issues are really just different aspects of the same issue. Lack of data standards – the fact that new technologies are coming on line which give new data through different methods – is the sign of an exciting field with lots of growth. I get that it would be easier for the sysadmins if all that innovation would just stop, so they could install one package and be done with it, but again, it’s HPC’s job to support researchers needs, not vice versa. We don’t see this so much in other fields, as they’re not evolving so rapidly; but “we’ll help you out once you stop discovering stuff so quickly” is not an answer we should be giving these researchers. Didn’t we get into this field to support exciting new research?

It’s absolutely true that these large number of small jobs analyzing data puts pressure on typical HPC filesystems, which weren’t designed for this use case. Fine. But it’s not at all clear to me that this mismatch is best addressed on the genomics researchers’ side by completely changing how they do research. There are filesystems which support this sort of use every day. And that brings me back to a bit about “sidestepping common HPC tools.”

This research community isn’t dumb. If they’re reluctant to use common HPC tools, like say MPI, there’s probably a reason for that. MPI is a 25-year-old technology which is holding the HPC community back. It was architected in a different era, for very specific use cases. For things like distributed hash tables, which are very handy for bioinformatics, it’s just an incredibly poor tool for the job. Similarly typical HPC parallel filesystems; there’s a reason that many bioinformatics centres pay good money for Isilion instead of starting up Lustre. It’s not naivet√©; it’s a clear-eyed assessment of what works for their problems.

There are other parallel filesystems – and parallel compute frameworks – which can potentially work well for bioinformatics. But they’re not from within the HPC community. You can see this when more and more bioinformatics is being done on Amazon, largely in serial-farming mode; but projects like ADAM, on Spark and HDFS, could make real inroads into truly parallel computations on parallel filesystems that work well for the purpose – completely outside of, and without the help of, the traditional HPC community.

And that, for HPC, is the danger. If we keep telling genomics researchers that what they’re doing isn’t HPC, eventually they’ll have no choice but to believe us – and to move to communities and technology stacks which do meet their needs.

By: Patrik D'haeseleer

Patrik D'haeseleer — Thu, 12 Mar 2015 00:41:59 +0000

There are definitely a few applications that are far more compute intensive than simply mapping sequence variants to a known reference. De novo assembly of large metagenomes is a good example. Especially if you have dozens of related metagenomes (such as a metagenomic time series, or different enrichment cultures starting from the same inoculum) that you would all like to co-assemble together. Ideally, you’d use a cluster with a very large shared memory for a job like that, but that’s not an architecture HPC has focused much on.

But yeah, most of the time we’re doing things that might have qualified as HPC ten years ago, but fit on a hefty desktop machine or small cluster today.

Most problems in biology are not as embarrasingly parallelizable as the physics codes that HPC systems are usually designed for though. They tend to involve a lot more heterogeneous data and complex interactions. So the bottleneck rarely lies in how much data we have, or how many CPU cycles we can throw at a problem.