The goal is to efficiently run large simulation ensembles, commonly called “ensemble runs”, on coming pre-exascale and exascale systems, an approach used when a sample of simulation runs is required for getting a statistical estimation of the application behavior on some given parameter ranges. Ensemble run are used for uncertainty quantification and data assimilation for instance. EoCoE-II is targets developing an innovative ensemble management infrastructure by enabling elasticity, resilience and modularity to enable the execution of very large ensembles of advanced parallel applications. The standard approach for data assimilation relies on a single large monolithic MPI execution that makes it sensitive to faults, load balancing issues and require to reserve up-front large fractions of supercomputers.
Two out of five EoCoE-II Scientific Challenges (Weather and Hydrology) integrate support for ensemble runs for data assimilation or sensitivity analysis. The goal is to develop this novel framework and demonstrates that it enables to run efficiently run ensemble of more than 1024 members for the Weather and Hydrology Scientific Challenges.
Melissa (https://melissa-sa.github.io/)
Melissa is a file avoiding, adaptive, fault tolerant and elastic framework, to run large scale sensitivity analysis on supercomputers. Largest runs so far involved up to 30k core, executed 80000 parallel simulations, and generated 288 TB of intermediate data that did not need to be stored on the file system. In the context of EoCoE-II Melissa
will evolve to support statistical data assimilation like the Ensemble Kalman filter (EnKF).
Early results with the under-development framework already manage to run efficiently more than 1024 members with the Parflow hydrology code.
All the EoCoE-I and EoCoE-II publications are available here (openAIRE).