Sunday, December 22, 2024

Sustainable High Performance Computing

Frontier, the world’s top-ranked supercomputer, operated by the US Department of Energy (DOE) at Oak Ridge National Laboratory, draws approximately 20 MegaWatts in continuous operation. At a representative electricity cost of $0.20 per KiloWatt-hour or $200 per MegaWatt-hour, powering Frontier costs about $4,000 per hour. Keeping the numbers “round” at 10 Kilohours per year (8,760 to be more precise), the annual electricity bill is about $40 million.

The carbon footprint to deliver a KiloWatt-hour of electricity is about 0.5 kg CO2-equivalent (CO2e), which implies a 10,000 kg CO2e hourly carbon footprint for Frontier, or about 100,000 metric tons CO2e annually.  This is equivalent, in turn, to about 20,000 typical passenger cars driven at average fuel efficiency for an average distance over the course of year in the USA.

For all of its consumption, Frontier is an exceptionally efficient machine.  Its performance of more than 1 Exaflop/s (1,000,000,000,000,000,000 floating point operations per second) at 20 MW puts it at the top of the “Green500” ranking of supercomputers, at 50 Gigaflop/s per Watt.  This performance efficiency is not exceeded by any other of the top 500 registered supercomputers in the world.

A 10% further improvement would save $4M/year and take 2,000 cars off the road.  (In practice, this would allow instead 10% more science to be done at the same cost. The goal is not to employ these expensive machines at less than their peak capability, but to get more scientific service per KW-hr.)  A 10X improvement would save $36M/year and take 18,000 cars off the road.  The good news is that such additional efficiencies are often readily available without changing the hardware – from better software.  Indeed, between the November 2022 and May 2023 Top500 rankings, Frontier’s performance on a dense matrix algebra benchmark was improved by 10% by software while delivering the same result – exact to the high arithmetic precision of the hardware.

Meanwhile, 10X improvement in another dense matrix algebra task was realized by my group by sacrificing some of that precision – so little that the statisticians applying our results were led to identical conclusions. The resulting geospatial statistics climate application earned us a finalist position for the 2022 Gordon Bell Prize, awarded every November for the past 35 years at the international Supercomputing Conference for breakthroughs in performance on real applications, as opposed to benchmarks. Now that “science per KW-hr” is a matter of planetary stewardship, many algorithmic solutions of this nature are coming into prominence – and none too soon, since Aurora, which is expected to bypass Frontier as the world’s most powerful computer later this year, currently being installed at Argonne National Laboratory, is expected to draw approximately 60 MW! While these US DOE systems are impressive in the computing capability that their many millions of coordinated cores can deliver to a single application, some cloud service centers operated by hyperscalers such as Amazon, Apple, Google, Meta, and Microsoft now surpass 100 MW.

The information and communication technologies (ICT) economic sector has emerged as a rapidly growing contributor to humanity’s carbon footprint, with estimates that by the end of the decade, it could consume up to 20% of the world’s electricity.  This fraction depends on how fast other sectors, such as electric vehicles, grow, but the demand for ICT growth appears inexorable, which puts a premium on computing more sustainably.

As evidenced by Frontier, computer architects have been doing their share. Over the past decade computational efficiency of the most powerful system on the Green500 list has improved by 15X. It remains for algorithm designers  and users to do theirs.  Our introduction to the improvement of efficiency at the expense of tunable sacrifices to accuracy began in earnest in 2018 when a matrix algorithm variant called “tile low rank” (TLR) replaced the standard algorithm in one of our codes.  The TLR variant hunts for “data sparsity” throughout the matrix and replaces blocks that are originally “fat” and dense with the product of two “skinny” matrices.  The compressibility of individual blocks, up to a specified sacrifice of accuracy in reproducing the original, varies from none at all for some blocks to almost all for others.  After this compressive transformation, TLR looks bad by three traditional algorithmic figures of merit: (1) it scores a lower percentage of peak flop/s for given hardware, (2) it has poorer balance of load between cores handling different parts of the matrix in parallel, and (3) it scales less efficiently because the amount of computation is reduced relative to data motion.  However, it can achieve an acceptable result for our examples in up to 10X less time and at 65% of the average power over that reduced elapsed time – a nearly 15X improvement in “science per KW-hr.”  This matches the ten years of architectural improvements that currently culminate in Frontier – all from a short exercise of reprogramming.

In 2003, I edited a two-volume report for the US DOE to which 315 of its leading computational scientists contributed titled “The Science-based Case for Large-Scale Simulation” (SCaLeS).  Scientists from disciplines ranging from nuclear fusion energy to combustion to astrophysics to materials science and more were asked to plot the improving capability of their codes over the same period that the exponential improvement in transistors per unit area known as “Moore’s Law” had been in effect – since 1965.  Across all fields, we found that algorithmic breakthroughs punctuated by sharp bursts in capability per unit of computation rode above the steady increase coming from architecture.  This is a phenomenon that continues today, while the rate of improvement from Moore’s Law has begun to fade.  To this phenomenon we gave the provocative name “Algorithmic Moore’s Law.”

And what of one of the fastest growing consumers of computational cycles today – the training of neural networks, especially the spectacularly large networks known as generative pretrained transformers (GPTs), which can contain trillions of tunable parameters and require comparably large datasets – ingesting the likes of wikipedia and the digitized corpus of English literature?  We believe that many algorithmic breakthroughs await that can deliver networks of equal predictive power while consuming orders of magnitude less electrical power along the way.

As computational infrastructure demands a growing sector of research budgets and global energy expenditure, addressing the need for greater efficiency is paramount.  The high performance computing community has excelled at this historically in three aspects: better hardware, better algorithms, and redefining actual outputs of interest in applications to avoid the “oversolving” relative to application requirements of many traditional methods.  Indeed, the growing gap between ambitious applications and austere architectures must increasingly be spanned by adaptive algorithms.  Sustainable high performance computing is a transdisciplinary effort requiring all hands on deck!

Latest