SORS: Increasing Productivity with MPI Malleability

Date: 04/Apr/2023 Time: 16:00

Place:

BSC Repsol Building Auditorium and Zoom.

Primary tabs

Objectives

Abstract: Scientific applications run on supercomputers where thousands of nodes are shared among users. When those applications start, their resources remain allocated until the job ends. We have detected two potential approaches in resource managing, with which we increase the global throughput and provide a better utilization of the underlying resources.

The Dynamic Management of Resource (DMR) framework is conceived to facilitate the programmability of malleable applications automating resource reallocation, process handling, and data distribution. DMR is based on the Message Passing Interface (MPI) programming model, the standard de facto for developing HPC distributed applications. DMR adjusts the process number of the jobs depending on the cluster status in terms of resource availability and quantity of pending jobs.
Performance analyses have reported a makespan reduction of 4x, when combined with moldability, compared to traditional rigid workloads. DMR has also been used in GPU-capable workloads improving their energy efficiency up to 2.5x.

The relevance of the DMR malleability solution is such that it has been incorporated for the European projects: “The European Pilot” EuroHPC-JU, DEEP-SEA, and TimeX.

Short bio: Sergio Iserte holds the degrees of BS in Computer Engin eering (2011), MS in Intelligent Systems (2014), and Ph.D. in Computer Science (2018) from Universitat Jaume I (UJI), Spain. Sergio Iserte is a senior researcher at Barcelona Supercomputing Center (BSC) in the Computer Science Department, and course instructor of the HPC subject at Universitat Oberta de Catalunya (UOC). He is currently involved in HPC projects related to parallel distributed computing, resource management, workload modeling, deep learning for industrial applications, and in-network accelators

Speakers

Speaker: Sergio Iserte, Accelerators and Communications for HPC Research Group, CS
Host: Xavier Martorell, Parallel Programming Models Group Manager, CS