Scalable Solvers

Home > Technical Challenges > Scalable Solvers

Solving Linear Algebra (LA) problems is a core task in four out of five EoCoE II Scientific Challenges (SC) and thus the availability of exascale-enabled LA solvers is fundamental in preparing the SC applications for the new exascale ecosystem. The goal of WP3 is to design and implement exascale-enabled LA solvers for the selected applications and to integrate them into the application codes. A co-design approach between LA and SC experts will be used; nevertheless, the solvers will be also developed in a more general perspective, to obtain LA tools useful for a wider range of applications. The LA experts involved in WP3 have a long-standing experience in developing solvers for HPC platforms, and the planned activities will build upon software and methodologies developed by them and tested during EoCoE I. New algorithms and disruptive technologies will be also considered, to tackle the new challenges posed by the envisioned exascale systems.

In order to achieve the previous goal, the following steps will be performed:

  1. analysis of the LA kernels of the applications, to clearly identify the needs of the applications in terms of LA solvers, and to select the best LA methodologies and software to work with. Actually, this work has been triggered during EoCoE I, providing a sound basis for EoCoE II
  2. extension, modification, and re-factoring of the selected LA solvers, based on a co-design approach between LA experts and application experts, to ensure that solvers and applications evolve accordingly in their route toward exascale
  3. design and implementation of novel solvers, for applications where modifying and re-factoring the available LA solvers does not appear satisfactory
  4. Integration of the LA solvers into the applications, in strict collaboration between LA and application experts, and testing and tuning on problems of interest

The work will address the Material, Water, Fusion and Wind Scientific Challenges, where strong needs for exascale-enabled LA solvers have emerged during EoCoE I. A different task for each SC has been planned, plus a task concerning transversal activities. The HPC packages developed by the LA experts participating in EoCoE II will provide a sound basis.

PSBLAS

The PSBLAS library, developed with the aim to facilitate the parallelization of computationally intensive scientific applications, is designed to address parallel implementation of iterative solvers for sparse linear systems through the distributed memory paradigm. It includes routines for multiplying sparse matrices by dense vectors, solving block diagonal systems with triangular diagonal entries, preprocessing of sparse matrices, parallel data exchange, mesh index handling and auxiliary operations on dense vectors.

It is available at https://github.com/sfilippone/psblas3

MLD2P4

MLD2P4 (MultiLevel Domain Decomposition Parallel Preconditioners Package based on PSBLAS) provides parallel Algebraic MultiGrid (AMG) and Domain Decomposition preconditioners, to be used in the iterative solution of linear systems. It has been designed to provide scalable and easy-to-use algebraic multilevel preconditioners in the context of the PSBLAS computational framework and is used in conjunction with the Krylov solvers available from PSBLAS.

It is available at https://github.com/sfilippone/mld2p4-2

MUMPS

MUltifrontal Massively Parallel sparse direct Solver (MUMPS) is a parallel solver for sparse systems of linear equations. Based on the multifrontal method, MUMPS can solve square symmetric positive definite, symmetric indefinite and un-symmetric problems using the LDL T or the LU factorization of the system matrix. MUMPS provides a very wide range of features. Among these, the most significant is certainly the ability to use numerical pivoting in a parallel distributed memory setting which renders the solution method practically stable and thus reliable on a very wide set of problems. Other widely used features include computation of the Schur complement of a matrix, Out-Of-Core execution, computation of selected entries of the inverse of a matrix, system pre and post-processing facilities and, more recently, low-rank approximations. MUMPS uses MPI and OpenMP for achieving parallelism on distributed and shared memory parallel systems.

It is available at http://mumps-solver.org/

AGMG

AGMG (AGgregation-based algebraic MultiGrid) solves systems of linear equations with an aggregation-based algebraic multigrid method. It is expected to be efficient for large systems arising from the discretization of scalar second order elliptic PDEs. The method is however purely algebraic and may be tested on any problem.

AGMG has been designed to be easy to use by non experts (in a black box fashion). It is available both as a software library for FORTRAN or C/C++ programs, and as a Octave/Matlab function.

For more information and download, see http://www.agmg.eu/

Chameleon

Keywords: Runtime system – Task-based algorithm – Dense linear algebra – HPC – Task scheduling

Scientific Description: Chameleon is part of the MORSE (Matrices Over Runtime Systems @ Exascale) project. The overall objective is to develop robust linear algebra libraries relying on innovative runtime systems that can fully benefit from the potential of those future large-scale complex machines.

We advance in three directions based first on strong and closed interactions between the runtime and numerical linear algebra communities. This initial activity will then naturally expand to more focused but still joint research in both fields.

Functional Description: Chameleon is a dense linear algebra software relying on sequential task-based algorithms where sub-tasks of the overall algorithms are submitted to a Runtime system. A Runtime system such as StarPU is able to manage automatically data transfers between not shared memory area (CPUs-GPUs, distributed nodes). This kind of implementation paradigm allows to design high performing linear algebra algorithms on very different type of architecture: laptop, many-core nodes, CPUs-GPUs, multiple nodes. For example, Chameleon is able to perform a Cholesky factorization (double-precision) at 80 TFlop/s on a dense matrix of order 400 000 (e.i. 4 min).

Release Functional Description: Chameleon includes the following features:

– BLAS 3, LAPACK one-sided and LAPACK norms tile algorithms – Support QUARK and StarPU runtime systems and PaRSEC since 2018 – Exploitation of homogeneous and heterogeneous platforms through the use of BLAS/LAPACK CPU kernels and cuBLAS/MAGMA CUDA kernels – Exploitation of clusters of interconnected nodes with distributed memory (using OpenMPI)

MAPHYS

Massively Parallel Hybrid Solver

Keyword: Parallel hybrid direct/iterative solution of large linear systems

Functional Description: MaPHyS is a software package that implements a parallel linear solver coupling direct and iterative approaches. The underlying idea is to apply to general unstructured linear systems domain decomposition ideas developed for the solution of linear systems arising from PDEs. The interface problem, associated with the so called Schur complement system, is solved using a block preconditioner with overlap between the blocks that is referred to as Algebraic Additive Schwarz. A fully algebraic coarse space is available for symmetric positive definite problems, that insures the numerical scalability of the preconditioner.

The parallel implementation is based on MPI+thread. Maphys relies on state-of-the art sparse and dense direct solvers.

MaPHyS is essentially a preconditioner that can be used to speed-up the convergence of any Krylov subspace method and is coupled with the ones implemented in the Fabulous package.

PaStiX

Parallel Sparse matriX package

Keywords: Linear algebra – High-performance calculation – Factorisation – Sparse Matrices – Linear Systems Solver

Scientific Description: PaStiX is based on an efficient static scheduling and memory manager, in order to solve 3D problems with more than 50 million of unknowns. The mapping and scheduling algorithm handle a combination of 1D and 2D block distributions. A dynamic scheduling can also be applied to take care of NUMA architectures while taking into account very precisely the computational costs of the BLAS 3 primitives, the communication costs and the cost of local aggregations.

Functional Description: PaStiX is a scientific library that provides a high performance parallel solver for very large sparse linear systems based on block direct and block ILU(k) methods. It can handle low-rank compression techniques to reduce the computation and the memory complexity. Numerical algorithms are implemented in single or double precision (real or complex) for LLt, LDLt and LU factorization with static pivoting (for non symmetric matrices having a symmetric pattern). The PaStiX library uses the graph partitioning and sparse matrix block ordering packages Scotch or Metis.

The PaStiX solver is suitable for any heterogeneous parallel/distributed architecture when its performance is predictable, such as clusters of multicore nodes with GPU accelerators or KNL processors. In particular, we provide a high-performance version with a low memory overhead for multicore node architectures, which fully exploits the advantage of shared memory by using an hybrid MPI-thread implementation.

The solver also provides some low-rank compression methods to reduce the memory footprint and/or the time-to-solution.

Work in progress

Work in progress

All the EoCoE-I and EoCoE-II publications are available here (openAIRE).