Exascale Computing and the National Strategic Computing Initiative (2020 DOE transition)

Book 2 - Issue Papers

DOE 2020 Transition book - Issue papers cover.jpg

Entire 2020 DOE Transition book

As of October 2020

It is critical to national security and economic competitiveness to maintain the Department of Energy’s Exascale Computing Initiative. The July 2015 Executive Order 13702 established the National Strategic Computing Initiative (NSCI) and identified DOE as one of the lead agencies. The NSCI called upon the DOE Office of Science (SC) and DOE National Nuclear Security Administration (NNSA) to “execute a joint program focused on advanced simulation through a capable exascale computing program emphasizing sustained performance on relevant applications and analytic computing to support their missions.”

  • Over the past six decades, U.S. computing capabilities have been maintained through continuous research and the development and deployment of new computing systems with rapidly increasing performance on applications of major significance to government, industry, and academia. Maximizing the benefits of High Performance Computing (HPC) in the coming decades will require an effective national response to increasing demands for computing power; emerging technological challenges and opportunities; and growing economic dependency on and competition with other nations. This national response will require a cohesive, strategic effort within the Federal Government and a close collaboration between the public and private sectors .
  • In 2016, DOE initiated research and development activities to deliver at least one exascale (1018 operations per second) computing capability in calendar year 2021 with two other DOE exascale systems delivered in the 2022-2023 timeframe. This activity, referred to as the ECI, is a partnership between the SC and the NNSA that addresses DOE’s science and national security mission requirements .

Issue(s)

Early summer 2020, Japan overtook the U.S. on the Top500 list that identifies the world’s most powerful high performance computers with the deployment of their 415 petaflop Fugaku system. “Flops” (floating-point operations per second) are the elementary unit of computational power: one flop corresponds to one calculation. One petaflop is one quadrillion (one thousand trillion or 1015) flops and one exaflop is one thousand petaflops (1018). Recognizing the importance of HPC to economic competitiveness, nations in Europe and Asia, particularly China, continue to invest in HPC. The Chinese strategy is increasingly to base their HPC systems on domestic technology, and China continues to lead the U.S. in the number of systems on the Top500 list. On the recent June 2020 TOP500 list, China has 226 systems vs. U.S.’ 114 systems. By all significant measures – top ranked, total number of supercomputers in the TOP500, aggregate total computing power, and software capable of sustained performance – China now dominates the U.S. in supercomputing. In addition, China is investing heavily in its domestic production capabilities and future computing technologies, such as quantum computing, neuromorphic computing, and artificial intelligence (see definitions below). In addition, China has 3 exascale machines in the pipeline: a Sunway system in Jinangnan targeted for 2020, a NUDT system in Tianjin targeted for 2021, and a Sugon system in Shenzhen targeted for 2022. The Chinese have an advantage in that they are not held back by an installed base that needs backward compatibility and therefore, there is no need to “play it safe,” leading to an open ended design space ranging from the conventional to the exotic. However, in the past two years, there has been no announcements of new Chinese systems in the Top500.

Currently, within DOE SC and DOE NNSA, the total leadership computing capability (combined capability of existing DOE high-performance computers) is over 400 petaflops. In FY 2017, the SC R&D portion of the ECI was segregated into the Office of Science Exascale Computing Project (SC-ECP) in SC’s Advanced Scientific Computing Research (ASCR) program. ECP provides the R&D necessary to effectively use exascale-capable systems while ECI is focused on the actual delivery of the exascale hardware. ASCR provides funds in ECI to support site preparations, non-recurring engineering investments and acceptance activities at the Argonne Leadership Computing Facility (ALCF) and the Oak Ridge Leadership Computing Facilities (OLCF). There are significant challenges associated with achieving this level of capacity due to the physical limits of existing computing technology and concomitant limitations in software design. Naive scaling of current high performance computing technologies would result in systems that are untenable in their energy consumption, data storage requirements, latency, and other factors. Unlike previous upgrades to DOE’s Leadership Computing Capabilities, an exascale system capable of meeting critical national needs cannot be developed through incremental improvement of existing systems.

For NNSA, the execution of ECI resides with the Advanced Simulation and Computing (ASC) program mostly in the Advanced Technology Development and Mitigation (ATDM) subprogram. Starting in FY2021, the NNSA ECI activities will be transitioned to the other ASC subprograms (Integrated Codes, Physics and Engineering Models, and Verification & Validation subprograms) to transfer the next- generation exascale application technologies to production service. The Computational Systems and Software Environment (CSSE) subprogram is responsible for procuring the El Capitan system and investing in production-ready exascale computing technologies. A General Plant Project (GPP) funding in the Facility Operation and User Support (FOUS) subprogram will “extend” the power from the walls of Lawrence Livermore National Laboratory (LLNL) Building 453 to the El Capitan system .

In addition to its importance for U.S. competitiveness, HPC is also a critical component of the national security, energy, and science missions of the Department of Energy.

National Security Needs

Stockpile stewardship, which underpins confidence in the U.S. nuclear deterrent, has been successful over the last two decades, largely as a result of modeling and simulation tools used in the NNSA Annual Assessment process, as well as solving issues arising from Significant Finding Investigations (SFIs). In the coming decade, the importance and role of HPC at the exascale computing performance level in this area will intensify, and exascale-based modeling and simulation tools will be increasingly called upon to provide required confidence, using robust uncertainty quantification techniques, in lifetime extensions of warheads in the U.S. nuclear weapons stockpile. These tools also will have an increasing role in understanding evolving nuclear threats posed by adversaries, both state and non- state, and in developing national policies to mitigate these threats .

Science

For nearly two decades, the department’s Science programs have utilized HPC to accelerate progress in a wide array of disciplines. Recent requirements- gathering efforts across the SC program offices indicate an increasing need for advanced computing at the exascale . Examples include: discovery and characterization of next-generation materials; development of reliable earthquake warnings and risk assessment; development of accurate regional impact assessments of climate; systematic understanding and improvement of chemical processes; analysis of the extremely large datasets resulting from the next generation of particle physics experiments; and extraction of knowledge from systems-biology studies of the microbiome . Dramatic improvements in public health may result from the application of exascale capabilities to cancer research, precision medicine and understanding the human brain .

Energy

For the past six years, the Energy programs have formulated strategic plans that rely on advanced computing capabilities at the exascale . Examples include: design of high efficiency, low emission combustion engines and gas turbines; improving the reliability and adaptability of the Nation’s power grid; increased efficiency and reduction in costs of turbine wind plants in complex terrains; and acceleration of the design and commercialization of next-generation small modular reactors . Advances in applied energy technologies also are dependent on next-generation simulations, notably whole- device modeling in plasma-based fusion systems.

In 2015, the interagency National Strategic Computing Initiative (NSCI)[1] was established by Executive Order to maximize the benefits of HPC for U.S. economic competitiveness, scientific discovery, and national security, and to ensure a cohesive, strategic effort within the Federal Government. DOE is one of three lead Federal agencies for the NSCI to deliver capable exascale computing .

DOE established the ECI in the President’s FY 2016 Budget Request. The DOE ECI will accelerate the development and deployment of DOE exascale computing systems and is DOE’s contribution to the interagency NSCI. Within DOE, the NNSA Office of Advanced Simulation and Computing (ASC) and SC Office of Advanced Scientific Computing Research (ASCR) are the lead organizations and are partners in the ECI. In addition to the NNSA/ASC and SC/ ASCR investments, the Department’s ECI also includes targeted scientific application development in SC’s Office of Basic Energy Sciences and Office of Biological and Environmental Research.

In FY 2016, the ECI was split into the Exascale Computing Project (ECP) and other exascale related activities. The ECP, a multi-lab project with its project office at DOE’s Oak Ridge National Laboratory, has as its sole focus the delivery of an ecosystem supporting DOE science, energy, and national security applications to run on at least two exascale machines. The ECP will follow the project management approach developed by DOE SC for large multi-lab projects such as the Linac Coherent Light Source and the Spallation Neutron Source.[2] As such, the ECP will be executed within a tailored framework that follows DOE Order (O) 413.3B, Program and Project Management for the Acquisition of Capital Assets, and defines critical decision points, overall project management, and requirements for control of a baselined schedule and cost. The first four years of ECP (FY 2016-2020) has focused on R&D directed at achieving system performance targets for parallelism, resilience, energy consumption, memory, and storage. The second phase, approximately the last four years of the ECP, will support production readiness of application and system software, and start of ECP operations. The other DOE ECI activities includes procurement of exascale computer systems, domain-specific software development in the Biological and Environmental Research and Basic Energy Sciences programs .

Milestone(s)

  • The DOE Acquisition Executive (Deputy Secretary) formally approved the Mission Need (Critical Decision 0) for the Exascale Computing Project (ECP) on July 28, 2016. Project milestones were finalized established when the project was baselined at Critical Decision 2 in February 2020.
  • In 2018, two DOE SC National Laboratories, Oak Ridge National Laboratory and Lawrence Berkley National Laboratory, were awarded the prestigious Gordon Bell Prize for work done on the Oak Ridge Leadership Computing Facility’s (OLCF’s) Summit supercomputer .[3]
  • In March 2019, DOE announced a contract with between Argonne National Laboratory and Intel to build an exascale system, called Aurora, in partnership with Cray (now HPE) and is expected to be deliver in the 2021-2022 timeframe. Aurora will be based on a future generation of Intel Xeon Scalable processor, Intel’s Xe compute architecture, a future generation of Intel Optane Datacenter Persistent Memory, and Intel’s One API software, all connected by Cray’s Slingshot interconnect and the Shasta software stack.
  • In May 2019, DOE announced a contract between Oak Ridge National Laboratory and Cray (now HPE) to build an exascale system, called Frontier, in partnership with AMD and expected to be delivered in calendar year 2021. Frontier is based on Cray’s Shasta architecture and Slingshot Interconnect and AMD EPYC CPU (central processing unit) and AMD Radeon Instinct GPU (graphic processing unit) technology .
  • In August 2019, DOE announced the award for the NNSA exascale system, named El Capitan, which will be delivered to LLNL starting early 2023. HPE will be the system integrator in partnership with AMD. Similar to Frontier, El Capitan will be powered by next-generation AMD EPYC Genoa CPUs and AMD Radeon Instinct GPUs, interconnected by Cray’s Slingshot fabric, and using the AMD Radeon Open Compute platform (ROCm) and Cray Shasta software stacks.
  • In 2019, a team from ETH Zürich was awarded the prestigious Gordon Bell Prize for their work simulating quantum transport—or the transport of electric charge carriers through nanoscale materials—using the Oak Ridge Leadership Computing Facility’s (OLCF’s).[4]
  • When the Deputy Secretary approved Alternatives Analysis (Critical Decision 1) and the issuance of research and development contracts with competitively selected vendors (Critical Decision 3a) in January 2017, approval for Establishing the Project Baseline (Critical Decision 2) was delegated to the Under Secretary for Science. An independent review of ECP, in December 2019, recommended that the project was ready for approval of their project baseline. An Energy Systems Acquisition Advisory Board (ESAAB), convened in February 2020, approved ECP’s project baseline.

Major Decisions/Events

Application and exascale software testing and scaling will be initiated on exascale testbeds during the first three months of 2021.

The first exascale system is to be delivered during calendar year 2021 .

Background

Over the past decade, DOE has become aware that future-generation systems will require significant changes in how high performance computers are designed, developed and programmed. Although focused on overcoming the same challenges, industry responses will be aimed at near-term solutions, which are inadequate to advance DOE’s scientific, engineering, and national defense missions . Addressing this national challenge requires a significant investment by the Federal government involving strong leadership from DOE headquarters, and close coordination by government, national laboratories, academia, and U.S. industry, including medium and small businesses .

Concurrent R&D investments in applications that will optimally exploit emerging, new exascale computing architectures is a critical component of the Department’s effort in exascale computing. These “extreme-scale” applications, i.e., applications designed to exploit exascale computing, must also be representative of applications requirements for the full spectrum of computing, from terascale to exascale. These should include those that support nuclear weapons stockpile stewardship; scientific discovery; energy technology innovation; renewable electrical generation and distribution; nuclear reactor design and longevity; data assimilation and analysis; and climate modeling. SC and NNSA have already initiated R&D efforts in key extreme-scale mission applications .

Four key challenges, identified in previous reports must be addressed to realize productive, efficient, and economical exascale systems:[5][6][7]

Parallelism

Parallelism (also termed “concurrency”) is a computer architecture in which multiple processors simultaneously execute multiple, smaller calculations broken down from an overall larger, complex problem. Since around 2004, increases in computing performance have resulted primarily from increasing the number of core processors (cores) on a chip. The number of cores, and hence the parallelism, has been increasing exponentially ever since. The Fugaku computer (415 Petaflops) has over 7 million cores . Exascale computers will have parallelism a thousand-fold greater than petascale systems. Design and development of the hardware and software for exascale systems to effectively exploit this level of parallelism will require R&D followed by focused deployment. System management software and science applications software for petascale systems, already difficult to develop, are not designed to work at such extreme parallelism. Increasing concurrency by a thousand fold will make software development much more difficult. To mitigate this complexity, a portion of the R&D investments will create tools that improve the programmability of exascale computers .

Memory and Storage

In past generations of computers, basic arithmetic operations (addition, multiplication, etc.) consumed the greatest amount of computer time required for a simulation. However, in the past decade, as central-processing-unit (CPU) microcircuits have increased in speed, moving data from the computer memory into the CPU now consumes the greatest amount of time. This issue has already surfaced in petascale systems, and it will become a critical issue in exascale systems . R&D is required to develop memory and storage architectures to provide timely access to and storage of information at anticipated computational rates .

Reliability

Exascale computers will contain significantly more electronic components than today’s petascale systems. Furthermore, the individual circuit components are expected to have feature sizes of about 7 nanometers, which is at the physical limit of how small circuits can be made. The resilience of circuits becomes a serious issue at this size because of quantum effects and cosmic rays that can randomly flip data bits. Achieving system-level reliability will require R&D to enable the exascale ecosystem to adapt dynamically to a constant stream of transient and permanent failures of components . Applications must be designed to be resilient, in spite of system and device failures, to produce accurate results .

Energy Consumption

Current 10-20 petaflop computers consume approximately 10 megawatts (MW) of electrical power . Simple extrapolation to the exascale level yields power requirements of 500–1,000 MW; at a cost of $1 million per MW-year, the operating cost of an exascale machine built on current technology would be prohibitive . Continuing discussions and partnerships with computer vendors have resulted in engineering improvements that have reduced the required power significantly.

Definitions

Artificial intelligence

Intelligence exhibited by machines, such as perceiving its environment and taking actions that maximize its chance of success at some goal.

Capable exascale computing

A supercomputer that can solve science problems 50 times faster (or more complex) than a 20-petaflop systems (e.g., Titan, Sequoia; is sufficiently resilient that user intervention due to

hardware or system faults is on the order of a week on average; and has a software stack that meets the needs of a broad spectrum of scientific applications and workloads).

Gordon Bell Prize

Awarded each year by the Association for Computing Machinery (ACM) to recognize outstanding achievement in high-performance computing .

High Performance Computing (HPC)

Most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical workstation or server in order to solve large problems in science, engineering, or business using applications that require high bandwidth, enhanced networking, and very high compute capabilities.

Megawatt

A unit for measuring power that is equivalent to one million watts . One megawatt is equivalent to the energy produced by 10 automobile engines .

Nanometer

A unit of measurement that is 10-9 meter, or one billionth of a meter.

Neuromorphic computing

The study of theoretical computing systems that attempt to mimic the computing abilities of the human brain to achieve faster, more energy- efficient computation.

Petaflops

A measure of a computer’s processing speed expressed as a thousand trillion floating-point operations per second .

Quantum computing

The study of theoretical computing systems that use quantum-mechanical phenomena to perform operations on data. Large-scale quantum computers would theoretically be able to solve certain classes of problems much more quickly than classical computers .

Scientific application

Simulating real-world phenomena using mathematics. The most well-known scientific applications are weather prediction models .

Uncertainty Quantification

The science of quantifying, characterizing, tracing, and managing uncertainties in experimental, computational and real-world systems .

References