It is becoming increasingly apparent that the future energy ecosystem will rely heavily on digitization to drive essential innovations in production and storage technologies, mitigate power source variability and manage its distribution via a complex hierarchy of micro- and macro-networks.
The EoCoE consortium is composed of world-leading research teams from four strategically important low-carbon energy domains in meteorology, materials, hydrology and fusion, linked together through a multi-disciplinary platform of high performance computing (HPC) and numerical mathematics to create a network of experts in computational energy science.
The EoCoE consortium developed a comprehensive, structured support pathway for enhancing the HPC capability of energy-oriented numerical models, from simple entry-level parallelism to fully-fledged exascale readiness. At the top end of this scale, promising applications from each energy domain have been selected to form the basis of 5 new Energy Science Challenges in the present successor project EoCoE-II that will be supported by 4 Technical Challenges as represented on the following figure.
The Science Challenges cover the following areas:
The Technical Challenges cover:
- Scalable Solvers
- Programming Models
- IO & Data Flow
- Ensemble Runs
The project organization can be seen as a Matrix structure where the 4 Technical Challenges (shown in the rows) are transversally interlinked to the 5 Scientific Challenges so that the HPC experts will work in a very integrated manner with the domain scientists.
In order to define the work packages (WP), we chose to group tasks by technical challenges rather than by scientific discipline as shown in the following figure. WP2 to WP5 represents the Technical Challenges. The main advantage of this choice is a common HPC and mathematical background of the work package leaders, which eases a lot the communication between WPs and enhances the project’s receptiveness to exascale hardware developments. Nevertheless, special care has been taken to ensure that the domain scientists retain leading roles in the project in order to keep it user-driven. The Exascale Co-design Group (ECG) [link toward the ECG section], a key component of the management structure in WP7, ensures that technological and design decisions are made by the code owners in close consultation with HPC experts. WP1 is responsible for the scientific challenges and applications. WP6 is dedicated to the dissemination and the networking.
In each Scientific Challenge, we have decided to focus refactoring and optimization on a limited number of flagship applications (see Scientific Challenge pages for details).
Flagship applications have different needs and will therefore not be equally represented in the Work Packages. The work to be done in each of them is schematically represented in the following figure.
A detailed description of each Work Package and their contribution to the different codes is given in the different tabs.
In addition to the flagship applications, so-called satellite codes will be used to provide some data, to be coupled with the flagship codes or to do some result comparison. Satellite codes will also be improved during the project or will benefit from WP expertise. External codes will be used but not modified within the project.
WP1 description: This work package gathers all the scientific expertise required to meet the scientific challenges and enable transformational Energy Science breakthroughs.
Scientists and principal investigators of the flagship codes strongly supported contribute here on:
- The coordination of the integration of HPC contributions from WP2-5 into flagship codes
- The development of new algorithms and new physics to meet the scientific challenge
- The production of scientific results with the developed numerical tools to illustrate their production readiness and meet the scientific objectives.
The five following graphics describe in detail the task breakdown for each scientific challenge and involved applications.
Work Package 2 (WP2) is dedicated to Programming Model related tasks.
The general objective of WP2 is to address HPC performance, scalability, code architecture and parallelism technology limitations to prepare selected applications to the Exascale ecosystem. It can be divided into 6 detailed objectives by order of importance:
- To evaluate application performance to guide optimization efforts
- To prepare applications to run more efficiently on targeted computer architectures and pre-Exascale platforms
- To make code flexible and adaptable for project developments and beyond including emerging still unknown Exascale architectures
- To keep readable codes despite use of complex modern components and ensure that new code versions will be understandable and therefore adopted by main developer teams, users and external contributing communities.
- To improve knowledge and expertise on HPC tools and libraries and to spread them to satellite codes and beyond the scope of EoCoE.
- To contribute to the development of HPC libraries from application experiences.
Solving Linear Algebra problems is a core task in four out of five EoCoE II Scientific Challenges and thus the availability of exascale-enabled Linear Algebra solvers is fundamental in preparing the SC applications for the new exascale ecosystem. The goal of WP3 is to design and implement exascale-enabled Linear Algebra solvers for the selected applications and to integrate them into the application codes. A co-design approach between Linear Algebra and Scientific Challenge experts will be used; nevertheless, the solvers will be also developed in a more general perspective, to obtain Linear Algebra tools useful for a wider range of applications. The Linear Algebra experts involved in WP3 have a long-standing experience in developing solvers for HPC platforms, and the planned activities will build upon software and methodologies developed by them and tested during EoCoE I. New algorithms and disruptive technologies will be also considered, to tackle the new challenges posed by the envisioned exascale systems.
In order to achieve the previous goal, the following steps will be performed:
- Analysis of the Linear Algebra kernels of the applications, to clearly identify the needs of the applications in terms of Linear Algebra solvers, and to select the best Linear Algebra methodologies and software to work with. Actually, this work has been triggered during EoCoE I, providing a sound basis for EoCoE II.
- Extension, modification, and refactoring of the selected Linear Algebra solvers, based on a co-design approach between Linear Algebra experts and application experts, to ensure that solvers and applications evolve accordingly in their route toward exascale.
- Design and implementation of novel solvers, for applications where modifying and refactoring the available Linear Algebra solvers does not appear satisfactory.
- Integration of the Linear Algebra solvers into the applications, in strict collaboration between Linear Algebra and application experts, and testing and tuning on problems of interest.
The figures below describe, for each scientific challenge, how the WP3 will participate to the development and the optimization of the selected applications.
Within the I/O & Data Flow work package of the EoCoE II we want to target four main objectives:
- Improvement of I/O accessibility: Different I/O libraries support a variety of different configuration options. Depending on the situation these options must be continually updated, or a complete new library must be adapted. A generic interface will be introduced with the help of the Portable Data Interface (PDI), which decouples the I/O API and the application to allow easier switching between different I/O subsystems.
- I/O performance: The data writing and reading time can consume a significant part of the overall application runtime and should be minimized. For this we want to leverage the optimization options of different I/O libraries in use as well by adapting intermediate elements such as a flash storage.
- Resiliency: Running an application on a large scale increases the chance of hard- or software problems if more and more computing elements are involved in the calculation. Additional I/O techniques can be used to reduce the effort needed to restart a broken run or even avoid an overall crash, by storing intermediate snapshots to the storage elements. Here the focus is on resiliency for ensemble calculations as planned within WP5 with very large checkpointing files.
- Data size reduction: Running an application on a larger scale often implies an increasing data size, which can become unmanageable and consume resources. Within this task the overall data size will be reduced without losing necessary information via in-situ and in-transit processing, moving postprocessing elements directly into the frame of the running application. This objective is directly coupled to the work in WP5, which provides additional workflow functionalities.
The following figures describe for each scientific challenge how the WP4 will participate to the development and the optimization of the selected applications.
Two out of five of the EoCoE-II Scientific Challenges (Weather and Hydrology) integrate some support for ensemble run for data assimilation or sensitivity analysis, usually relying on one monolithic big MPI job, like ESIAS developed during EoCoE I. The goal of WP5 is to develop an elastic exascale-ready framework for ensemble runs extending the Melissa approach developed at INRIA, and to empower the Weather, and Hydrology EoCoE-II applications enabling them to take benefit of next generation exascale machines.
This novel approach will enable simulations or groups of simulations to be submitted to the machine batch scheduler independently. Once these jobs start they dynamically connect to a parallel data processing parallel server. This server gets the data from the running simulation that are processed in parallel on-line as soon as available, thus avoiding intermediate files. The computed partial results can be retro-fed to the simulation (needed for data assimilation). They can also be used to support an adaptive sampling process where the set of parameters for the next simulation runs are defined according to these partial results. Such approach enables to fully take benefit of the loose synchronization capabilities between simulation runs: simulations are submitted and allocated independently enabling to better use the machine resources, and to support for efficient fault tolerance mechanisms enabling to take benefit of exascale machines.
WP5 builds on top of WP4 (I/O) work re-using the PDI interface for data access, the FTI library for fault tolerance and the in-situ/in-transit extensions.
Following figures describe for each scientific challenge how the WP5 will participate to the development and the optimization of the selected applications.
Work Package 6: Dissemination and Networking
The general objectives of the WP6 workpackage is to disseminate the CoE achievements by providing high-end exascale codes for selected, high-impact applications areas, a Software-as-a-Service (SaaS) portal to attract SMEs/Industries and to foster new collaborations and by organizing conferences and workshops, education and trainings as well as providing consultancy and expertise to laboratories, industries and SMEs. Moreover the workpackage contributes to broadly communicate information about the project, its results and its impacts to targeted end-users, regulators, other stakeholders and the general public. Indeed, very important are the networking activities that could enable the consortium to reach early adopters and other key stakeholders and to facilitate EoCoE’s impacts and large-scale uptake. The Networking activities follows the best practices for engaging and guiding stakeholders. Partners attend and present EoCoE at key events throughout the project to broaden the visibility of the solutions developed in the project and to establish links with external stakeholders (including seed capital, business angels, early-adopter industries& SMEs, innovation accelerators, regional clusters on HPC exploitation, etc.). Among the others, WP6 is active also in the establishment of a stable collaboration with EERA, seen as a key stakeholder in the topics of interest of the CoE.
WP6 is in charge of the public website, press releases and advertising materials, e-newsletter, social networks and promotional videos. It organizes face-to-face meetings, workshops and conferences.
Moreover In collaboration with PRACE/PATC and other organizations/projects, WP6 addresses the skills gap in computational science by specialised training and capacity building measures to develop the human capital resources for increased adoption of advanced HPC in academia and industry (including SMEs).