Memory systems for HPC

Primary tabs

Next-generation HPC memory systems face many challenges — the memory has to provide higher performance with a limited power budget, while reaching the reliability limits. To address these challenges we work on the confluence on the advanced memory technologies and the HPC memory requirements.

Summary

In state-of-the-art high-performance computing (HPC), design of memory systems is one of the most important decisions that significantly affect system performance. Also, memory systems became important contributors to the overall system power requirements, energy consumption, and the operational cost of large HPC systems. In addition to this, scaling of DRAM technology and increasing main memory capacity increases probability of DRAM errors that have already became a common source of system failures in the field. 

Barcelona Supercomputing Center has recognized the importance of research on memory systems for HPC. The goal of this research is to understand and overcome the challenges in the design of next-generation memory systems for large-scale HPC clusters. Currently we are exploring the following research areas: 

  • Detection and analysis of DRAM field errors
    We detect and analyze DRAM errors of production HPC workloads running on MareNostrum supercomputer, one of six Tier-0 systems in Europe. Our analysis provides a better understanding of DRAM errors in the field, and leads to re-optimization of DRAM design and test process, and higher stability of large-scale HPC systems.
  • Analysis of memory system requirements of HPC applications

    We characterize memory requirements of applications running on high-end HPC clusters. Also, we estimate requirements of next-generation HPC systems in terms of memory capacity, bandwidth, and latency. The findings of this analysis are used to enhance the memory system design of next-generation HPC platforms.

  • Simulation and evaluation of novel memory systems

    We simulate next-generation HPC memory systems based on STT-MRAM non-volatile technology and DRAM with novel interfaces and packaging. We quantify improvements of these systems with respect to the conventional DRAM DIMM-based memory organization. Moreover, we analyze which parts of the memory architecture and organization should be enhanced to properly exploit novel memory solutions.

 

Objectives

  • Building more reliable HPC memory systems. Reducing the impact of memory errors on the system stability.  
  • Understanding HPC applications memory requirements in terms of capacity, bandwidth and latency.  Design of memory systems that fit these requirements.
  • Evaluation of novel memory systems.