BSC works on proactive solutions for timing and reliability in HPC systems in the project RECIPE

30 June 2020

Driven by new application domains (i.e. computationally-intensive data analytics), new applications of massively parallel computations, and the increased ability of new customers to enter the market, the HPC market is developing quickly. BSC researchers are working to respond to this challenge in the European project RECIPE, aimed at providing solutions to manage that complexity and to make the system reliable.

RECIPE’s multilevel resource manager with novel reliability models for better workload allocation and optimization of computing resources

Barcelona Supercomputing Center (BSC) has developed a solution to predict the highest execution time of HPC applications on supercomputers and data centers, with either homogeneous or heterogeneous architectures, leveraging its experience on time predictability for critical real-time embedded systems. This solution has crystallized into a flexible and portable tool presented as part of a Special issue on Supercomputing and Mathematics of the prestigious open journal MDPI Mathematics.

Process to predict high execution time distribution for HPC applications

BSC has also developed a framework to predict the aging - and thus the reliability - of heterogeneous HPC platforms based on physical characteristics and their utilization. Such a framework, which is conceptually applicable to any computing and storage element, namely CPUs, GPUs, FPGAs, and of any kind of memories, has been realized for specific high-performance CPUs and FPGAs with promising results.

Proactive rather than reactive solutions for timing and reliability in HPC systems are the key towards effectively managing their resources during their whole lifetime“, said Ramon Canal, BSC Technical Leader for RECIPE, Associate Researcher of the Computer Architecture - Operating Systems (CAOS) Department, and Assistant Professor at the UPC.

BSC foresees integrating both technologies for timing and reliability prediction in a runtime manager for HPC application execution optimization for different parameters (such as timing, reliability and temperature) on HPC heterogeneous platforms including CPUs, GPUs and FPGAs. Moreover, BSC technologies will also be widely assessed against end user applications, extending the already promising results of these technologies on HPC applications excerpts.

Article: On the Use of Probabilistic Worst-Case Execution Time Estimation for Parallel Applications in High Performance Systems

DOI: https://doi.org/10.3390/math8030314

Link: https://www.mdpi.com/2227-7390/8/3/314

 

About RECIPE

RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a European-funded project with a budget of €3,2 million which started on 1 May 2018 and ends on 30 April 2021. Coordinated by Politecnico di Milano (Italy), the project brings together a multidisciplinary consortium composed by Universitat Politècnica de València (Spain), Centro Regionale Information Communication Technology (Italy), Barcelona Supercomputing Center (Spain), Poznań Supercomputing and Networking Center (Poland), École polytechnique fédérale de Lausanne (Switzerland), IBT Solutions (Italy) and Centre Hospitalier Universitaire Vaudois (Switzerland).

Further information can be found on the project’s website: http://www.recipe-project.eu/

 

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 801137.