MareNostrum 5 NG-GRACE

System overview

MareNostrum 5 NG-GRACE is a high-performance computing (HPC) cluster that consists of 408 nodes, each powered by NVIDIA's Grace CPU Superchip. This high-efficiency, ARM-based processor is specifically designed to meet the demanding memory and processing needs of HPC and AI tasks. The system, developed by GIGABYTE, is housed in H263-V60-AAW1 servers, which have a 2U form factor. Each server chassis contains four air-cooled, half-width nodes that are optimized for scientific and data-intensive applications.

This setup provides a total of 58,752 processor cores and 33.8TB of main memory, as detailed in the table below:

Node type	Node count	Cores per node	Main memory per node
Grace - General Purpose (GP)	408	144	240GB

MareNostrum 5 NG-GRACE nodes are primarily equipped with the following specifications:

2x NVIDIA Grace 72C 3.1GHz
2x Die 120GB 4266MHz LPDDR5
800GB NVMe local storage
NVIDIA/Mellanox ConnectX-7 NDR200 InfiniBand (200Gb/s total bandwidth per node, OSFP, PCIe 5.0 x16)

Topology of a compute node

This is a high-level analysis of the lstopo command output from a compute node with two NVIDIA Grace processors.

MareNostrum 5 Grace-GPP nodes topology

Node-Level Overview

Memory Configuration: The system has a total of 240GB, allocated across two NUMA nodes.
Processor Packages: The node comprises two sockets (Socket L#0 and Socket L#1). Each socket is associated with its own NUMA node and L3 cache.

NUMA Nodes and Memory Distribution

NUMA Node L#0: 120GB of memory.
NUMA Node L#1: 120GB of memory.

Cache Hierarchy

Each socket has a similar cache architecture:

L3 Cache: 114MB per socket.
L2 Cache: 1MB (1024KB) per core.
L1 Cache: Each core has two types of L1 cache:
- L1d Cache (Data): 64KB
- L1i Cache (Instructions): 64KB

Cores and Processing Units (PUs)

Each socket contains 72 cores, totaling 144 cores across the two sockets. The cores are structured as follows:

Each core has a dedicated L1 and L2 cache.
Each core is represented by a Processing Unit (PU), designated by PU L#.

Summary

This setup indicates a high-performance system with ample L3 cache, large memory, and dual NVIDIA Grace processors, which collectively offer robust parallel processing capabilities for workloads requiring high memory bandwidth and multi-threading.

MareNostrum 5 NG-GRACE

System overview​

Topology of a compute node​

Node-Level Overview​

NUMA Nodes and Memory Distribution​

Cache Hierarchy​

Cores and Processing Units (PUs)​

Summary​