Energy analysis on MareNostrum 4 | BSC Support Knowledge Center

On MareNostrum 4, the frequency is limited to 2,1 GHz by a recommendation of the vendor. During some tests removing this limit, we detected that depending on the application the performance increases or decreases. Also, we observed on some cases a higher power consumption.

Therefore, we run several tests measuring the performance and power consumption on nodes with and without the frequency limit.

Configuration

The tests have been done with two configurations:

NL (No Limit): Frequency unlimited. Frequency can reach up to 3,7 GHz depending on the Instruction Set and number of cores used.(https://en.wikichip.org/wiki/intel/xeon\_platinum/8160\#Frequencies)
DEF (Default): Frequency limited to 2,1GHz.

On both configurations the TurboBoost is enabled and the HyperThreading is disabled.

The tests have been done on two different reservations, one with 64 nodes and other with 8 nodes.

The nodes used during the tests are:

Nodes first reservation: s02r2b[25-48],s03r1b[49-72]
Nodes second reservation: s22r1b[33-40]

Benchmarks

Different applications had been tested on each reservation:

First reservation:

Alya (Input: sphere UEABS test case B)
CP2K
NAMD
WRF
CPMD
GROMACS

Second reservation:

HPL
IO
PYTHON (Torch)
GREASY (Python and R tasks)
VASP

Results

The results are the average of multiple simulations done on nodes with DEF and NL configurations. For each app, we have specified how many simulations were done and how many nodes were used for each simulation.

For each simulation result, we have added a new column where is compared the DEF results with the NL results, dividing the DEF by the NL.

Finally, we have added three rows that show the behaviour of the application during one hour using the 3456 nodes of MareNostrum 4. To calculate the number of simulations in one hour with full-machine, we have divided 3600 seconds by the average time for one simulation. Then we have multiplied this number by the division of 3456 nodes by the nodes used for one single simulation. The nodes used for each simulation depends on the application.

Number of simulation = (3600/Time) * (3456/nodes-per-simulation)

Alya

100 simulations

64 nodes

	DEF	NL
Time (s)	451,00	450,00
Energy (J)	11759807,71	11787738,12
Power (W)	26118,22	26120,76
Power per node (W)	408,10	408,14
Watts-h per job	3266,61	3274,37
Number of simulations in one hour (full-machine)	430,92	432,00
Joules in one Hour (full-machine)	5067536338,39	5092302867,84
Watts-h (full-machine)	1407648,98	1414528,57

Performance and energy consumption is very similar in both cases.

CP2K

10 simulations

64 nodes

	DEF	NL
Time (s)	2236,96	2083,35
Energy (J)	46470584,50	42964343,75
Power (W)	20767,47	20612,57
Power per node (W)	324,49	322,07
Watts-h per job	12908,50	11934,54
Number of simulations in one hour (full-machine)	86,40	93,42
Joules in one Hour (full-machine)	4015058500,80	4013728993,13
Watts-h (full-machine)	1115294,03	1114924,72

CP2K is faster and has low power consumption per node without limiting the frequency to 2,1GHz. Therefore, the total energy needed is lower without the limit.

NAMD

10 simulations

64 nodes

	DEF	NL
Time (s)	990,37	788,08
Energy (J)	17595866,70	15314703,16
Power (W)	17766,89	19432,69
Power per node (W)	277,60	303,63
Watts-h per job	4887,74	4254,08
Number of simulations in one hour (full-machine)	196,02	246,24
Joules in one Hour (full-machine)	3449141790,53	3771092506,12
Watts-h (full-machine)	958094,94	1047525,70

On NAMD we observe a higher power consumption if we remove the frequency limit, but as the simulation is faster without this limit, the energy needed for one simulation is lower than running with the limit.

WRF

40 simulations

64 nodes

	DEF	NL
Time (s)	218,36	216,65
Energy (J)	5584114,75	5177890,65
Power (W)	25572,49	23899,36
Power per node (W)	399,57	373,42
Watts-h per job	1551,14	1438,30
Number of simulations in one hour (full-machine)	890,27	897,30
Joules in one Hour (full-machine)	4971386276,79	4646120204,75
Watts-h (full-machine)	1380940,63	1290588,95

WRF has a similar behavior as CP2K, the simulation is faster, with lower power consumption and with a lower energy consumption.

CPMD

5 simulations

8 nodes

	DEF	NL
Time (s)	488,25	458,50
Energy (J)	1338703,00	1372779,25
Power (W)	2741,84	2994,07
Power per node (W)	342,73	374,26
Watts-h per job	371,86	381,33
Number of simulations in one hour (full-machine)	3185,25	3391,93
Joules in one Hour (full-machine)	4264108357,60	4656371405,89
Watts-h (full-machine)	1184474,54	1293436,50

CPMD we observe that the power consumption is higher without the limit. Also, it is faster, but is not faster enough to have a lower energy consumption than with the 2,1GHz limit.

GROMACS

14 simulations

64 nodes

	DEF	NL
Time (s)	260,75	232,09
Energy (J)	4912196,92	4721792,79
Power (W)	18838,94	20345,01
Power per node (W)	294,36	317,89
Watts-h per job	1364,50	1311,61
Number of simulations in one hour (full-machine)	745,55	837,62
Joules in one Hour (full-machine)	3662289812,91	3955070609,79
Watts-h (full-machine)	1017302,73	1098630,72

GROMACS behavior is similar to NAMD, where the application is faster enough to compensate the higher power consumption without the frequency limit, and therefore run a single simulation with less energy.

HPL

10 simulations

4 nodes

	DEF	NL
Time (s)	25840,90	26471,80
Energy (J)	29052054,00	36467064,60
Power (W)	1124,27	1377,58
Power per node (W)	281,07	344,40
Watts-h per job	8070,02	10129,74
Number of simulations in one hour (full-machine)	120,37	117,50
Joules in one Hour (full-machine)	3496918016,08	4284829808,77
Watts-h (full-machine)	971366,12	1190230,50

HPL is a clear case where the frequency limit is useful and we do not observe any benefit without it.

IO

25 simulations

1 nodes

	DEF	NL
Time (s)	57,64	61,04
Energy (J)	6112,40	6842,48
Power (W)	106,08	111,56
Power per node (W)	106,08	111,56
Watts-h per job	1,70	1,90
Read (MB/s)	136337,67	229237,61
Write (MB/s)	592,15	561,44
Number of simulations in one hour (full-machine)	215850,10	203827,00
Joules in one Hour (full-machine)	1319362176,27	1394682161,99
Watts-h (full-machine)	366489,49	387411,71

The I/O benchmark is a special case where the code runs for 1 minute doing I/O operations. We observe a small increment on the power and energy consumption without the limit, but we observe a huge improvement on the read bandwidth if we remove the frequency limit.

PYTHON (Torch)

25 simulations

1 nodes

	DEF	NL
Time (s)	1121,04	871,68
Energy (J)	337730,20	222738,16
Power (W)	301,27	255,53
Power per node (W)	301,27	255,53
Watts-h per job	93,81	61,87
Number of simulations in one hour (full-machine)	11098,27	14273,13
Joules in one Hour (full-machine)	3748219560,69	3179170213,22
Watts-h (full-machine)	1041172,10	883102,84

The Python code running a PyTorch benchmark is a similar case as we have seen with WRF, a clear case of benefits of removing the frequency limit. Without the limit we observe faster simulations, lower power and lower energy.

GREASY (Python and R tasks)

25 simulations

8 nodes

	DEF	NL
Time (s)	874,20	727,04
Energy (J)	2168997,60	2084698,16
Power (W)	2481,07	2867,41
Power per node (W)	310,13	358,43
Watts-h per job	602,50	579,08
Number of simulations in one hour (full-machine)	1779,00	2139,08
Joules in one Hour (full-machine)	3858642264,38	4459345535,92
Watts-h (full-machine)	1071845,07	1238707,09

On this benchmark, we run 400 GREASY tasks. The tasks are different R and Python codes. We observe that the 400 tasks are finished faster without the frequency limit but the power consumption is higher. Despite this, the total energy consumption is lower without the limit.

VASP

10 simulations

8 nodes

	DEF	NL
Time (s)	3124,11	3709,56
Energy (J)	8998321,22	11204534,89
Power (W)	2880,25	3020,48
Power per node (W)	360,03	377,56
Watts-h per job	2499,53	3112,37
Number of simulations in one hour (full-machine)	497,81	419,24
Joules in one Hour (full-machine)	4479416269,38	4697401487,22
Watts-h (full-machine)	1244282,30	1304833,75

As we have seen on HPL, the VASP benchmark has worst performance, higher energy and power consumption if we remove the frequency limit.

Conclusions

We could classify the results into four categories depending on the performance and energy consumption behaviour. We did not include the I/O benchmark on this classification, as it can be treated as a particular case.

First of all, we have Alya, where we can observe almost the same results in terms of performance and energy with or without the frequency limit. As we have seen on other tests, for example, with the HyperThreading and using fewer tasks per socket, Alya is a particular case.

Then, we observe some applications, as NAMD, CPMD, GROMACS and GREASY, where the Power of each node is higher without the frequency limit. Still, the simulation time is shorter, and that makes less energy consumption for one simulation.

Like the previous applications, we observe that the performance is better without the limit on applications such as CP2K, WRF and Python (Torch), but we also see lower Power. So, these are the cases where energy-saving is higher.

Finally, we observe worse performance on applications as VASP and HPL when we remove the limit to 2,1 GHz of the nodes. In consequence, we see higher energy consumption and Power. Both cases have a high usage of AVX-512 instructions.

Configuration​

Benchmarks​

Results​

Alya​

CP2K​

NAMD​

WRF​

CPMD​

GROMACS​

HPL​

IO​

PYTHON (Torch)​

GREASY (Python and R tasks)​

VASP​

Conclusions​

Configuration

Benchmarks

Results

Alya

CP2K

NAMD

WRF

CPMD

GROMACS

HPL

IO

PYTHON (Torch)

GREASY (Python and R tasks)

VASP

Conclusions