Choosing the right programming model(s) is an important decision that can strongly affect the development, the maintenance, the readability, the portability and the optimization of a code. Programming model challenges are tackled in the Work Package 2 of EoCoE. Here, the term programming models mainly refers to as programming, optimization and parallelism languages, frameworks or libraries.It could be rephrased High Performance Computing (HPC) challenges. Our goal consists in helping, advising and guiding scientist code developers non-expert in HPC to refactor and optimize their applications to reach the best performance and scalability on massively-parallel super-computers. Targeted super-computers can be composed of CPUs with accelerators such as GPUs. We currently test our improvements on Tier-0 European machines but our work is not limited to present technologies. Experts in HPC are preparing applications to run on future pre-exascale machines with disruptive technologies such as ARM-based processors and accelerators. We aim at prepare selected codes in the domain of energy to the Exascale ecosystem.
To achieve this goal, the programming model technical challenge has experts in performance evaluation. Having a systematic and continuous evaluation of the code performance enables to guide optimization developments all along the project and to monitor improvements. It helps to determine the best options to solve performance bottlenecks.
This section presents the different tools used for parallel programming issues as listed below.
Performance analysis and monitoring
LIKWID is a simple to install and use toolsuite of command line applications for performance oriented programmers. It works for Intel, AMD and ARMv8 processors on the Linux operating system.
See more: https://github.com/RRZE-HPC/likwid
Scalasca is a software tool that supports the performance optimization of parallel programs by measuring and analyzing their runtime behavior. The analysis identifies potential performance bottlenecks – in particular those concerning communication and synchronization – and offers guidance in exploring their causes.
See more: https://www.scalasca.org/
MPI – Message Passing Interface
Performance portability programming models
Kokkos Core implements a programming model in C++ for writing performance portable applications targeting all major HPC platforms. For that purpose it provides abstractions for both parallel execution of code and data management. Kokkos is designed to target complex node architectures with N-level memory hierarchies and multiple types of execution resources. It currently can use OpenMP, Pthreads and CUDA as backend programming models.
See more: https://github.com/kokkos/kokkos
Adaptive Mesh Refinement
The p4est software library enables the dynamic management of a collection of adaptive octrees, conveniently called a forest of octrees. p4est is designed to work in parallel and scales to hundreds of thousands of processor cores.
See more: http://p4est.org/
PDI (Portable Data Interface) supports loose coupling of simulation codes with libraries:
- the simulation code is annotated in a library-agnostic way,
- libraries are used from the specification tree.
This approach works well for a number of concerns including: parameters reading, data initialization, post-processing, result storage to disk, visualization, fault tolerance, logging, inclusion as part of code-coupling, inclusion as part of an ensemble run, etc.
Within EoCoE II PDI is used as the main data exchange interface either for classical I/O, visualisation or ensemble data handling.
See more: https://pdi.julien-bigot.fr/master
SENSEI provides simulations with a generic data interface that they use to provide access to their state. SENSEI then passes this data to zero or more analysis and visualization tasks each time the simulation provides more data.
See more: https://sensei-insitu.org/
This tab shows you the results obtained by the Work Package on programming models. These results are presented by application.
When talking about hybrid parallelism on both CPU and GPU, it is in fact rare that the two architectures work together asynchronously. In most codes, the CPU acts as a handler: it handles initialization, I/O, communications and some tasks that are better suited to this architecture. The GPU handles computationally intensive tasks. However, when one is running, the other is often on standby. What could formerly be done by the CPU is simply shifted to the GPU. This means that part of the allocated computing power is systematically lost. Today, task-based methods make it possible to use heterogeneous architectures asynchronously. Nevertheless these methods are still a subject of research and are not integrated into the most common software frameworks. Alya uses a load-balancing method to distribute some of its calculations equitably between the CPU and the GPU. The calculations are then performed at the same time. This allows to get the most power out of the computation nodes.
The Alya code has been adapted to work not only on CPUs but also on GPUs for Computational Fluid Dynamics problems, particularly Large Eddy Simulation cases. For such problems, a semi-implicit approach is used where the momentum equation is solved explicitly while the continuity equation is solved implicitly. Considering that most of Alya can run either in CPUs or GPUs, Alya’s developpers have decided to develop a co-execution approach that makes better use of current pre-exascale supercomputers, which typically blend GPUs and CPUs. A Fast and scalable geometric mesh partitioning based on Space-Filling Curve (SFC) has been key to enable the co-execution with a correct load balance between the GPUs and CPUs. At the beginning of the simulation, the SFC partitioning is called several times iteratively until an optimum partitioning of the mesh is obtained. In the rst iteration, each MPI task (be it CPU or GPU) receives a specic portion of the mesh according to some initial weights. With this partition, it calculates a couple of time steps. Based on the computational time taken by each MPI task, it adapts the weights and repartitions again. After a couple of iterations, each processor receives the correct amount of work so that they all take nearly the same time. GPUs obviously receive a more signicant chunk of the mesh than CPUs. In this way, the work done by the CPUs is spared in comparison to a pure GPU calculation.
A new possibility to write diagnostics has been implemented using the PDI API.
A first example is the development done in the ParFlow to port the code on GPU. The developers adopted a strategy of abstracting memory allocation and loop management by designing an in-house DSL. This was done for several reasons. Firstly, it allows to hide from scientific developers the complexity of the code related to parallelism and HPC. Physicists can implement algorithms without worrying about which platform they will run on. Conversely, this allows HPC developers to optimize parallelism and support for new architectures seamlessly. This goes in the direction of a software engineering oriented towards the cutting of expertise and portability. In addition, developers may test the Kokkos programming model to do this work with a C++ formalism. The method used here should serve as an example for porting many applications.
All the EoCoE-I and EoCoE-II publications are available here (OpenAIRE).