Home > News > In-situ data processing with DASK: Public Webinar

In-situ data processing with DASK: Public Webinar

27th March 2025,  10 am

In-situ data processing with DASK 

Webinar organized by CASTIEL2

Abstract: In situ data processing refers to the online processing of data produced by large-scale parallel simulations to reduce the amount of data saved to disk and avoid the I/O bottleneck that affects large supercomputers. There exist various solutions based on the classical MPI model that can be very efficient. However, today data analytics tools are dominated by Python packages that offer highly productive environments familiar to many users (NumPy, Pandas, scikit-learn, PyTorch, JAX). While these packages often support limited parallelization, frameworks like Dask (https://www.dask.org) and Ray (https://ray.io) have proposed seamless parallelizations  to enable processing large datasets based on distributed  task-based programming. The task programming paradigm reduces parallelization effort as users express potential parallelism as a graph of functions where functions without data dependencies can be executed in parallel. At execution, a scheduler handles task scheduling on available cluster nodes, considering dependencies and data locality. During this talk we will show how we can combine MPI-based solvers with Dask-based in situ data processing. We will  present work done at INRIA and Maison de La Simulation for  leveraging  Dask  to enable  in situ data processing from data produced by MPI codes. We will discuss different technical points, including  data and control transfer between MPI and Dask, efficient data selection to minimize unnecessary transfers, metadata system for describing data structure and encoding, adaptation between simulation and analytics data structures, as well as performance at scale. This work is part of the EoCoe-III projet  that will deploy at large scale in situ analytics for the Gysela and Parflow solvers.

Register link