Exascale Computing and the National Strategic Computing Initiative (2020 DOE transition): Difference between revisions

no edit summary
No edit summary
 
Line 1: Line 1:
{{Notice|DOE’s Office of Science (SC) and National Nuclear Security Administration (NNSA) have partnered to establish the Exascale Computing Initiative (ECI) to deliver capable exascale computing for DOE science, technology, and national security mission needs. DOE is one of the Federal leads in the interagency National Strategic Computing Initiative (NSCI) focused on delivering exascale computing to advance U.S. economic competitiveness and national security.}}{{TB 2020 Book 2}}
{{Notice|DOE’s [[Office of Science]] (SC) and [[National Nuclear Security Administration]] (NNSA) have partnered to establish the [[Exascale Computing Initiative]] (ECI) to deliver capable exascale computing for DOE science, technology, and national security mission needs. DOE is one of the Federal leads in the interagency [[National Strategic Computing Initiative]] (NSCI) focused on delivering exascale computing to advance U.S. economic competitiveness and national security.}}{{TB 2020 Book 2}}
It is critical to national security and economic competitiveness to maintain the Department of Energy’s [[Exascale Computing Project|Exascale Computing Initiative]]. The [https://www.govinfo.gov/app/details/CFR-2016-title3-vol1/CFR-2016-title3-vol1-eo13702 July 2015 Executive Order 13702] established the [[National Strategic Computing Initiative]] (NSCI) and identified DOE as one of the lead [[Federal government of the United States|agencies]]. The NSCI called upon the DOE [[Office of Science]] (SC) and DOE [[National Nuclear Security Administration]] (NNSA) to “execute a joint program focused on advanced simulation through a capable [[exascale computing program]] emphasizing sustained performance on relevant applications and analytic computing to support their missions.”
It is critical to national security and economic competitiveness to maintain the Department of Energy’s [[Exascale Computing Project|Exascale Computing Initiative]]. The [https://www.govinfo.gov/app/details/CFR-2016-title3-vol1/CFR-2016-title3-vol1-eo13702 July 2015 Executive Order 13702] established the [[National Strategic Computing Initiative]] (NSCI) and identified DOE as one of the lead [[Federal government of the United States|agencies]]. The NSCI called upon the DOE [[Office of Science]] (SC) and DOE [[National Nuclear Security Administration]] (NNSA) to “execute a joint program focused on advanced simulation through a capable [[exascale computing program]] emphasizing sustained performance on relevant applications and analytic computing to support their missions.”


Line 8: Line 8:
Early summer 2020, Japan overtook the U.S. on the Top500 list that identifies the world’s most powerful high performance computers with the deployment of their 415 petaflop Fugaku system. “Flops” (floating-point operations per second) are the elementary unit of computational power: one flop corresponds to one calculation. One petaflop is one quadrillion (one thousand trillion or 1015) flops and one exaflop is one thousand petaflops (1018). Recognizing the importance of HPC to economic competitiveness, nations in Europe and Asia, particularly China, continue to invest in HPC. The Chinese strategy is increasingly to base their HPC systems on domestic technology, and China continues to lead the U.S. in the number of systems on the Top500 list. On the recent June 2020 TOP500 list, China has 226 systems vs. U.S.’ 114 systems. By all significant measures – top ranked, total number of supercomputers in the TOP500, aggregate total computing power, and software capable of sustained performance – China now dominates the U.S. in supercomputing. In addition, China is investing heavily in its domestic production capabilities and future computing technologies, such as quantum computing, neuromorphic computing, and artificial intelligence (see definitions below). In addition, China has 3 exascale machines in the pipeline: a Sunway system in Jinangnan targeted for 2020, a NUDT system in Tianjin targeted for 2021, and a Sugon system in Shenzhen targeted for 2022. The Chinese have an advantage in that they are not held back by an installed base that needs backward compatibility and therefore, there is no need to “play it safe,” leading to an open ended design space ranging from the conventional to the exotic. However, in the past two years, there has been no announcements of new Chinese systems in the Top500.
Early summer 2020, Japan overtook the U.S. on the Top500 list that identifies the world’s most powerful high performance computers with the deployment of their 415 petaflop Fugaku system. “Flops” (floating-point operations per second) are the elementary unit of computational power: one flop corresponds to one calculation. One petaflop is one quadrillion (one thousand trillion or 1015) flops and one exaflop is one thousand petaflops (1018). Recognizing the importance of HPC to economic competitiveness, nations in Europe and Asia, particularly China, continue to invest in HPC. The Chinese strategy is increasingly to base their HPC systems on domestic technology, and China continues to lead the U.S. in the number of systems on the Top500 list. On the recent June 2020 TOP500 list, China has 226 systems vs. U.S.’ 114 systems. By all significant measures – top ranked, total number of supercomputers in the TOP500, aggregate total computing power, and software capable of sustained performance – China now dominates the U.S. in supercomputing. In addition, China is investing heavily in its domestic production capabilities and future computing technologies, such as quantum computing, neuromorphic computing, and artificial intelligence (see definitions below). In addition, China has 3 exascale machines in the pipeline: a Sunway system in Jinangnan targeted for 2020, a NUDT system in Tianjin targeted for 2021, and a Sugon system in Shenzhen targeted for 2022. The Chinese have an advantage in that they are not held back by an installed base that needs backward compatibility and therefore, there is no need to “play it safe,” leading to an open ended design space ranging from the conventional to the exotic. However, in the past two years, there has been no announcements of new Chinese systems in the Top500.


Currently, within DOE SC and DOE NNSA, the total leadership computing capability (combined capability of existing DOE high-performance computers) is over 400 petaflops. In FY 2017, the SC R&D portion of the ECI was segregated into the Office of Science Exascale Computing Project (SC-ECP) in SC’s Advanced Scientific Computing Research (ASCR) program. ECP provides the R&D necessary to effectively use exascale-capable systems while ECI is focused on the actual delivery of the exascale hardware. ASCR provides funds in ECI to support site preparations, non-recurring engineering investments and acceptance activities at the Argonne Leadership Computing Facility (ALCF) and the Oak Ridge Leadership Computing Facilities (OLCF). There are significant challenges associated with achieving this level of capacity due to the physical limits of existing computing technology and concomitant limitations in software design. Naive scaling of current high performance computing technologies would result in systems that are untenable in their energy consumption, data storage requirements, latency, and other factors. Unlike previous upgrades to DOE’s Leadership Computing Capabilities, an exascale system capable of meeting critical national needs cannot be developed through incremental improvement of existing systems.
Currently, within DOE SC and DOE NNSA, the total leadership computing capability (combined capability of existing DOE high-performance computers) is over 400 petaflops. In FY 2017, the SC R&D portion of the ECI was segregated into the Office of Science [[Exascale Computing Project]] (SC-ECP) in SC’s [[Advanced Scientific Computing Research]] (ASCR) program. ECP provides the R&D necessary to effectively use exascale-capable systems while ECI is focused on the actual delivery of the exascale hardware. ASCR provides funds in ECI to support site preparations, non-recurring engineering investments and acceptance activities at the [[Argonne Leadership Computing Facility]] (ALCF) and the [[Oak Ridge Leadership Computing Facilities]] (OLCF). There are significant challenges associated with achieving this level of capacity due to the physical limits of existing computing technology and concomitant limitations in software design. Naive scaling of current high performance computing technologies would result in systems that are untenable in their energy consumption, data storage requirements, latency, and other factors. Unlike previous upgrades to DOE’s Leadership Computing Capabilities, an exascale system capable of meeting critical national needs cannot be developed through incremental improvement of existing systems.


For NNSA, the execution of ECI resides with the Advanced Simulation and Computing (ASC) program mostly in the Advanced Technology Development and Mitigation (ATDM) subprogram. Starting in FY2021, the NNSA ECI activities will be transitioned to the other ASC subprograms (Integrated Codes, Physics and Engineering Models, and Verification & Validation subprograms) to transfer the next- generation exascale application technologies to production service. The Computational Systems and Software Environment (CSSE) subprogram is responsible for procuring the El Capitan system and investing in production-ready exascale computing technologies. A General Plant Project (GPP) funding in the Facility Operation and User Support (FOUS) subprogram will “extend” the power from the walls of Lawrence Livermore National Laboratory (LLNL) Building 453 to the El Capitan system .
For NNSA, the execution of ECI resides with the [[Advanced Simulation and Computing]] (ASC) program mostly in the Advanced Technology Development and Mitigation (ATDM) subprogram. Starting in FY2021, the NNSA ECI activities will be transitioned to the other ASC subprograms (Integrated Codes, Physics and Engineering Models, and Verification & Validation subprograms) to transfer the next- generation exascale application technologies to production service. The Computational Systems and Software Environment (CSSE) subprogram is responsible for procuring the El Capitan system and investing in production-ready exascale computing technologies. A General Plant Project (GPP) funding in the Facility Operation and User Support (FOUS) subprogram will “extend” the power from the walls of [[Lawrence Livermore National Laboratory]] (LLNL) Building 453 to the El Capitan system .


In addition to its importance for U.S. competitiveness, HPC is also a critical component of the national security, energy, and science missions of the Department of Energy.
In addition to its importance for U.S. competitiveness, HPC is also a critical component of the national security, energy, and science missions of the Department of Energy.
Line 23: Line 23:
For the past six years, the Energy programs have formulated strategic plans that rely on advanced computing capabilities at the exascale . Examples include: design of high efficiency, low emission combustion engines and gas turbines; improving the reliability and adaptability of the Nation’s power grid; increased efficiency and reduction in costs of turbine wind plants in complex terrains; and acceleration of the design and commercialization of next-generation small modular reactors . Advances in applied energy technologies also are dependent on next-generation simulations, notably whole- device modeling in plasma-based fusion systems.
For the past six years, the Energy programs have formulated strategic plans that rely on advanced computing capabilities at the exascale . Examples include: design of high efficiency, low emission combustion engines and gas turbines; improving the reliability and adaptability of the Nation’s power grid; increased efficiency and reduction in costs of turbine wind plants in complex terrains; and acceleration of the design and commercialization of next-generation small modular reactors . Advances in applied energy technologies also are dependent on next-generation simulations, notably whole- device modeling in plasma-based fusion systems.


In 2015, the interagency National Strategic Computing Initiative (NSCI)<ref><nowiki>https://www.whitehouse.gov/the-press-office/2015/07/29/executive-order-creating-national-strategic-computing-initiative</nowiki></ref> was established by Executive Order to maximize the benefits of HPC for U.S. economic competitiveness, scientific discovery, and national security, and to ensure a cohesive, strategic effort within the Federal Government. DOE is one of three lead Federal agencies for the NSCI to deliver capable exascale computing .
In 2015, the interagency [[National Strategic Computing Initiative]] (NSCI)<ref><nowiki>https://www.whitehouse.gov/the-press-office/2015/07/29/executive-order-creating-national-strategic-computing-initiative</nowiki></ref> was established by Executive Order to maximize the benefits of HPC for U.S. economic competitiveness, scientific discovery, and national security, and to ensure a cohesive, strategic effort within the Federal Government. DOE is one of three lead Federal agencies for the NSCI to deliver capable exascale computing .


DOE established the ECI in the President’s FY 2016 Budget Request. The DOE ECI will accelerate the development and deployment of DOE exascale computing systems and is DOE’s contribution to the interagency NSCI. Within DOE, the NNSA Office of Advanced Simulation and Computing (ASC) and SC Office of Advanced Scientific Computing Research (ASCR) are the lead organizations and are partners in the ECI. In addition to the NNSA/ASC and SC/ ASCR investments, the Department’s ECI also includes targeted scientific application development in SC’s Office of Basic Energy Sciences and Office of Biological and Environmental Research .
DOE established the ECI in the President’s FY 2016 Budget Request. The DOE ECI will accelerate the development and deployment of DOE exascale computing systems and is DOE’s contribution to the interagency NSCI. Within DOE, the NNSA Office of Advanced Simulation and Computing (ASC) and SC [[Office of Advanced Scientific Computing Research]] (ASCR) are the lead organizations and are partners in the ECI. In addition to the NNSA/ASC and SC/ ASCR investments, the Department’s ECI also includes targeted scientific application development in SC’s [[Office of Basic Energy Sciences]] and [[Office of Biological and Environmental Research]].


In FY 2016, the ECI was split into the Exascale Computing Project (ECP) and other exascale related activities. The ECP, a multi-lab project with its project office at DOE’s Oak Ridge National Laboratory, has as its sole focus the delivery of an ecosystem supporting DOE science, energy, and national security applications to run on at least two exascale machines. The ECP will follow the project management approach developed by DOE SC for large multi-lab projects such as the Linac Coherent Light Source and the Spallation Neutron Source.<ref>http://science.energy.gov/user-facilities/</ref> As such, the ECP will be executed within a tailored framework that follows DOE Order (O) 413.3B, Program and Project Management for the Acquisition of Capital Assets, and defines critical decision points, overall project management, and requirements for control of a baselined schedule and cost. The first four years of ECP (FY 2016-2020) has focused on R&D directed at achieving system performance targets for parallelism, resilience, energy consumption, memory, and storage. The second phase, approximately the last four years of the ECP, will support production readiness of application and system software, and start of ECP operations. The other DOE ECI activities includes procurement of exascale computer systems, domain-specific software development in the Biological and Environmental Research and Basic Energy Sciences programs .
In FY 2016, the ECI was split into the [[Exascale Computing Project]] (ECP) and other exascale related activities. The ECP, a multi-lab project with its project office at DOE’s Oak Ridge National Laboratory, has as its sole focus the delivery of an ecosystem supporting DOE science, energy, and national security applications to run on at least two exascale machines. The ECP will follow the project management approach developed by DOE SC for large multi-lab projects such as the [[Linac Coherent Light Source]] and the [[Spallation Neutron Source]].<ref>http://science.energy.gov/user-facilities/</ref> As such, the ECP will be executed within a tailored framework that follows DOE Order (O) 413.3B, Program and Project Management for the Acquisition of Capital Assets, and defines critical decision points, overall project management, and requirements for control of a baselined schedule and cost. The first four years of ECP (FY 2016-2020) has focused on R&D directed at achieving system performance targets for parallelism, resilience, energy consumption, memory, and storage. The second phase, approximately the last four years of the ECP, will support production readiness of application and system software, and start of ECP operations. The other DOE ECI activities includes procurement of exascale computer systems, domain-specific software development in the Biological and Environmental Research and Basic Energy Sciences programs .


== Milestone(s) ==
== Milestone(s) ==
The DOE Acquisition Executive (Deputy Secretary) formally approved the Mission Need (Critical Decision 0) for the Exascale Computing Project (ECP) on July 28, 2016. Project milestones were finalized established when the project was baselined at Critical Decision 2 in February 2020.


In 2018, two DOE SC National Laboratories, Oak Ridge National Laboratory and Lawrence Berkley National Laboratory, were awarded the prestigious Gordon Bell Prize for work done on the Oak Ridge Leadership Computing Facility’s (OLCF’s) Summit supercomputer .<ref>https://www.olcf.ornl.gov/2018/11/20/2018-acm-gordon-bell-prize/</ref>
* The DOE Acquisition Executive (Deputy Secretary) formally approved the Mission Need (Critical Decision 0) for the Exascale Computing Project (ECP) on July 28, 2016. Project milestones were finalized established when the project was baselined at Critical Decision 2 in February 2020.
 
* In 2018, two DOE SC National Laboratories, Oak Ridge National Laboratory and Lawrence Berkley National Laboratory, were awarded the prestigious Gordon Bell Prize for work done on the [[Oak Ridge Leadership Computing Facility]]’s (OLCF’s) Summit supercomputer .<ref>https://www.olcf.ornl.gov/2018/11/20/2018-acm-gordon-bell-prize/</ref>
In March 2019, DOE announced a contract with between Argonne National Laboratory and Intel to build an exascale system, called Aurora, in partnership with Cray (now HPE) and is expected to be deliver in the 2021-2022 timeframe. Aurora will be based on a future generation of Intel Xeon Scalable processor, Intel’s Xe compute architecture, a future generation of Intel Optane Datacenter Persistent Memory, and Intel’s One API software, all connected by Cray’s Slingshot interconnect and the Shasta software stack.
* In March 2019, DOE announced a contract with between [[Argonne National Laboratory]] and Intel to build an exascale system, called Aurora, in partnership with Cray (now HPE) and is expected to be deliver in the 2021-2022 timeframe. Aurora will be based on a future generation of Intel Xeon Scalable processor, Intel’s Xe compute architecture, a future generation of Intel Optane Datacenter Persistent Memory, and Intel’s One API software, all connected by Cray’s Slingshot interconnect and the Shasta software stack.
 
* In May 2019, DOE announced a contract between Oak Ridge National Laboratory and Cray (now HPE) to build an exascale system, called Frontier, in partnership with AMD and expected to be delivered in calendar year 2021. Frontier is based on Cray’s Shasta architecture and Slingshot Interconnect and AMD EPYC CPU (central processing unit) and AMD Radeon Instinct GPU (graphic processing unit) technology .
In May 2019, DOE announced a contract between Oak Ridge National Laboratory and Cray (now HPE) to build an exascale system, called Frontier, in partnership with AMD and expected to be delivered in calendar year 2021. Frontier is based on Cray’s Shasta architecture and Slingshot Interconnect and AMD EPYC CPU (central processing unit) and AMD Radeon Instinct GPU (graphic processing unit) technology .
* In August 2019, DOE announced the award for the NNSA exascale system, named El Capitan, which will be delivered to LLNL starting early 2023. HPE will be the system integrator in partnership with AMD. Similar to Frontier, El Capitan will be powered by next-generation AMD EPYC Genoa CPUs and AMD Radeon Instinct GPUs, interconnected by Cray’s Slingshot fabric, and using the AMD Radeon Open Compute platform (ROCm) and Cray Shasta software stacks.
 
* In 2019, a team from ETH Zürich was awarded the prestigious Gordon Bell Prize for their work simulating quantum transport—or the transport of electric charge carriers through nanoscale materials—using the Oak Ridge Leadership Computing Facility’s (OLCF’s).<ref>https://www.olcf.ornl.gov/2019/11/21/tiny-transistor-leads-to-big-win-for-eth-zurich-2019-acm-gordon-bell-prize-winner/</ref>
In August 2019, DOE announced the award for the NNSA exascale system, named El Capitan, which will be delivered to LLNL starting early 2023. HPE will be the system integrator in partnership with AMD. Similar to Frontier, El Capitan will be powered by next-generation AMD EPYC Genoa CPUs and AMD Radeon Instinct GPUs, interconnected by Cray’s Slingshot fabric, and using the AMD Radeon Open Compute platform (ROCm) and Cray Shasta software stacks.
* When the Deputy Secretary approved Alternatives Analysis (Critical Decision 1) and the issuance of research and development contracts with competitively selected vendors (Critical Decision 3a) in January 2017, approval for Establishing the Project Baseline (Critical Decision 2) was delegated to the Under Secretary for Science. An independent review of ECP, in December 2019, recommended that the project was ready for approval of their project baseline. An Energy Systems Acquisition Advisory Board (ESAAB), convened in February 2020, [https://www.energy.gov/management/downloads/energy-systems-acquisition-advisory-board-esaab-members-july-2014 approved ECP’s project baseline].
 
In 2019, a team from ETH Zürich was awarded the prestigious Gordon Bell Prize for their work simulating quantum transport—or the transport of electric charge carriers through nanoscale materials—using the Oak Ridge Leadership Computing Facility’s (OLCF’s).<ref>https://www.olcf.ornl.gov/2019/11/21/tiny-transistor-leads-to-big-win-for-eth-zurich-2019-acm-gordon-bell-prize-winner/</ref>
 
When the Deputy Secretary approved Alternatives Analysis (Critical Decision 1) and the issuance of research and development contracts with competitively selected vendors (Critical Decision 3a) in January 2017, approval for Establishing the Project Baseline (Critical Decision 2) was delegated to the Under Secretary for Science. An independent review of ECP, in December 2019, recommended that the project was ready for approval of their project baseline. An Energy Systems Acquisition Advisory Board (ESAAB), convened in February 2020, [https://www.energy.gov/management/downloads/energy-systems-acquisition-advisory-board-esaab-members-july-2014 approved ECP’s project baseline].


== Major Decisions/Events ==
== Major Decisions/Events ==
Line 104: Line 99:
=== Uncertainty Quantification ===
=== Uncertainty Quantification ===
The science of quantifying, characterizing, tracing, and managing uncertainties in experimental, computational and real-world systems .
The science of quantifying, characterizing, tracing, and managing uncertainties in experimental, computational and real-world systems .
== References ==