Four different I/O & Data Flow releated technical challenges are targeted within EoCoE-II:
- Improvement of I/O accessibility: Different I/O libraries support a variety of different configuration options. Depending on the situation these options must be continually updated, or a complete new library must be adapted. We want to introduce a generic interface with the help of the Portable Data Interface (PDI), which decouples the I/O API and the application to allow easier switching between different I/O subsystems. These interface should either serve standard I/O operation but should also be useful in context of ensemble or in-situ visualisation data movement.
- I/O performance: The data writing and reading time can consume a significant part of the overall application runtime and should be minimized. For this we want to leverage the optimization options of different I/O libraries in use as well by adapting intermediate storage elements such as flash storage devices.
- Resiliency: Running an application on a large scale increases the chance of hard- or software problems if more and more computing elements are involved in the calculation. Additional I/O techniques can be used to reduce the effort needed to restart a broken run or even avoid an overall crash, by storing intermediate snapshots to the storage elements. In particular we want to focus on resiliency for ensemble calculations.
- Data size reduction: Running an application on a larger scale often implies an increasing data size, which can become unmanageable and consume too much resources. Within this task we want to reduce the overall data size without losing necessary information via in-situ and in-transit processing, moving post-processing elements directly into the frame of the running application.
PDI (Portable Data Interface) supports loose coupling of simulation codes with libraries:
- the simulation code is annotated in a library-agnostic way,
- libraries are used from the specification tree.
This approach works well for a number of concerns including: parameters reading, data initialization, post-processing, result storage to disk, visualization, fault tolerance, logging, inclusion as part of code-coupling, inclusion as part of an ensemble run, etc.
Within EoCoE II PDI is used as the main data exchange interface either for classical I/O, visualisation or ensemble data handling.
See more: https://pdi.julien-bigot.fr/master
SIONlib is a library for writing and reading data from several thousands of parallel tasks into/from one or a small number of physical files. Only the open and close functions are collective while file access can be performed independently.
SIONlib can be used as a replacement for standard I/O APIs (e.g. POSIX) that are used to access distinct files from every parallel process. SIONlib will bundle the data into one or few files in a coordinated fashion in order to sidestep sequentialising mechanism in the file system. At the same time, the task-per-file picture is maintained for the application, every process has access to its logical file only.
See more: http://www.fz-juelich.de/jsc/sionlib
FTI stands for Fault Tolerance Interface and is a library that aims to give computational scientists the means to perform fast and efficient multilevel checkpointing in large scale supercomputers. FTI leverages local storage plus data replication and erasure codes to provide several levels of reliability and performance. FTI is application-level checkpointing and allows users to select which datasets needs to be protected, in order to improve efficiency and avoid wasting space, time and energy. In addition, it offers a direct data interface so that users do not need to deal with files and/or directory names. All metadata is managed by FTI in a transparent fashion for the user. If desired, users can dedicate one process per node to overlap fault tolerance workload and scientific computation, so that post-checkpoint tasks are executed asynchronously.
See more: https://github.com/leobago/fti
The Infinite Memory Engine (IME), as designed by partner DDN, one of the main storage technologies used successfully in HPC setups today. IME is a layered hybrid approach, where the capacity file system is extended by a layer of low-latency devices to cope with response time requirements. DDN IME originates from HPC applications and architectures, where
I/O has been carefully co- designed over many years, yielding to an eco-system where the hardware is enhanced with complex middleware to meet application requirements.
Applications have two possible data path to access to the IME client, either the Fuse interface (expose as a traditional mount point) or directly using the IME Native interface. The former offering compliance and ease of access and the later the extreme level of performance. Once of the technical milestone of EoCoE II is to interface the SIONlib middleware directly with the IME Native Interface.
Work in progress
All the EoCoE-I and EoCoE-II publications are available here (openAIRE).