My Questions about ORNL Summit Answered

As you might know, before I got back into the ed biz and working with small data, I used to big data with a high performance computing (HPC) project that was a joint effort between the University of Tennessee and the Department of Energy’s Oak Ridge National Laboratory (ORNL). That was long enough ago that my t-shirt from “been there, done that, got the t-shirt” features Jaguar, which at a peak performance of under 2 petaflops is not only retro but also remarkably quaint. Soon we will have petascale performance in our phones! (OK, probably not.)

Aside: When Titan (previous system) was being provisioned, some people in HPC were super-enthusiastic about GPUs and other accelerator technologies (such as the Intel Xeon Phi). Others referred to any capabilities provided by technologies other than traditional x86 cores as “crap flops.” This reminded me of discussions that I heard in the 1990s when I worked for a different government department with an interest in HPC and people were freaking out about having to take the work that they had been doing on vector processor machines (like the Cray C90) and figure out how to get work done on a parallel machine (at that time, the Cray T3D).

But now we are on the road to the exascale. If you read anything about the exascale 5–10 years ago, you would have learned that this was an insurmountable challenge. Exascale systems would use more energy than a thousand suns. Exascale systems would need a constant tsunami of ice water to keep the energy of these thousand suns from causing them to burst into flames. There would be so many bits traveling through an exascale system that the mysterious subatomic particles emitted by the one regular sun would randomly and mischieviously flip some of them, requiring sophisticated on-chip error-correcting hardware that require the power of still a few more suns.

Apparently we have solved many of these problems, as Summit has been deployed to production, and from what I can tell, it lives in the same machine room where Titan, and Jaguar before it, stood. This is a room about the size of a grocery store, and you can comfortably fit over 300 cabinets in it. I haven’t checked a Google Maps satellite view of ORNL lately (nor do I know if there is a recent picture), but you can get a decent sense of the amount of cooling that is being used by the size and number of air conditioner compressors outside of building 5400.

My Facebook feed has lit up with articles about Summit. My relatives are emailing me articles about Summit. ORNL PR people have been putting a spin on things, and regular science journalists do not understand HPC as well as I do, so there were a lot of things about the story that seemed confusing and did not make sense to me.

But here is what I have discovered:

Summit is an IBM system and not a Cray system! This is mostly interesting for people who are watching the industry, as ORNL (and what I will subtly and euphemistically refer to as its collaborators) have been Cray shops for a really long time.
The current version of Summit has a peak performance of 200 petaflops. That is, one-fifth of an exaflop.
One of the press releases talks about how some genomics code achieved performance of 1.88 exaops. You will note that this are exaops, not exaflops. This calculation was not exclusively done with floating point numbers! From what I remember about GPUs and my limited experience with CUDA, this now makes perfect sense. So they haven’t yet been able to get LINPACK to achieve \(10^{18}\) floating point operations per second, but the genomics code doesn’t need that many floating point operations. Genomic data is fairly discrete. There are only four DNA base pairs, 64 possible codons, and 20 relevant amino acids. Probably the most likely place you would need floating point numbers is when calculating scores or probabilities after you’ve compared various DNA sequences.
This also explains why the first application is in the biological sciences and not a simulation of an exploding star. The Department of Energy is really the Department of Nuclear Energy and Nuclear Bombs. A large fraction of its leadership computing power is devoted to simulating things blowing up. In public, they talk about this as astrophysics research, but a lot of the math and computing that goes into exploding stars also applies to other nuclear explosions. But simulating exploding stars requires floating point precision, so they were not going to get the most press-release-worthy results. I think that the “energy” angle here is that we are pretending that bio-energy can help protect us from having to buy oil from people who we think are icky.
I was also wondering why Rick Perry thought that it was OK to spend a very large number of millions of dollars on a computer whose debut calculation was in the biological sciences. This seems counter to the current trend of cabinet-level officials destroying their departments from within. Also, Rick Perry doesn’t seem like the kind of guy who is interested in understanding hidden Markov models and multiple sequence alignments. But then I remembered that Chinese machines have been dominating the Top 500 for a while. It looks like the top Chinese system from November 2017 was at about 125 petaflops—or roughly half of Summit’s current capabilities on LINPACK. Odds are that Summit will take the top spot on the Top 500 list coming out later this month. And “USA NUMBER ONE! GO USA!” is certainly one of the current administration’s priorities.