GPU Optimisation of a Finite Element Code for Incompressible Flow presented at ECCOMAS Congress 2024

Herbert Owen has partecipated at the 9th European Congress on Computational Methods in Applied Sciences and Engineering – ECCOMAS Congress 2024 (3-7 June 2024, Lisbon, Portugal)

with a paper titled: GPU Optimisation of a Finite Element Code for Incompressible Flow, a work done together with FAU.

Abstract

GPU OPTIMISATION OF A FINITE ELEMENT CODE FOR INCOMPRESSIBLE FLOW

Herbert Owen1, Oriol Lehmkuhl1 , Georg Hager2, Gerhard Wellein2 and Dominik Ernst2
1 Barcelona Supercomputing Centre (BSC) Barcelona, Spain, herbert.owen@bsc.es
2 NHR@FAU Erlangen, Germany, dominik.ernst@fau.de

Keywords: Finite element method, GPU Optimization, incompressible flow

We present a detailed description of the optimisation of the momentum assembly for
the incompressible flow module of the Alya low-order finite element code. Alya is a
high-performance computational mechanics code to solve complex coupled multi-physics
problems developed by engineers, physicists and computational experts at the Barcelona
Supercomputing Center. It is one of the two CFD codes of the Unified European Applica-
tions Benchmark Suite (UEBAS) and the Accelerator benchmark suite of PRACE. In this
work, we focus on scale-resolving simulations solved using a fractional step scheme to un-
couple momentum and continuity equations and an explicit treatment of the momentum
equation [1]. For such problems, the two main computational kernels are the momentum
assembly, analysed in this work, and the solution of the Poisson system for the pressure,
for which we found that the optimal approach is to rely on external Algebraic Multigrid
libraries such as PSCToolkit. The optimisation targets GPU architectures using Ope-
nACC, but we have found that most of the work also benefits CPUs. The analysis shows
that the large number of intermediate values combined with the semantics of globally
allocated temporary arrays are the root of all performance problems on the GPU. The
enhancements can be categorised as follows. Restructure to determine which values are
computed at what time and in which order. Specialise, giving up some generality that is
rarely used or can be recovered at compile time. Privatise the intermediate result arrays
instead of allocating large global vectors. A roofline model is used to show how the differ-
ent modifications enhance the performance of the GPU. The combination of previously
mentioned improvements leads to a speedup of more than 50x on an NVIDIA A100 GPU
and a 5x speedup on the CPU. The final version is much more energy efficient on the
GPU than on the CPU, as one would expect. We believe the observed anti-patterns and
solutions can be transferable to other code bases with a similar development history.

REFERENCES
[1] Oriol Lehmkuhl, Guillaume Houzeaux, Herbert Owen, Georgios Chrysokentis and
Ivette Rodrguez, A low-dissipation finite element scheme for scale resolving simula-
tions of turbulent flows Journal of Computational Physics, 390, 51-65. Submitted to
The Journal of Supercomputing.

News