On MareNostrum 4, the frequency is limited to 2,1 GHz by a recommendation of the vendor. During some tests removing this limit, we detected that depending on the application the performance increases or decreases. Also, we observed on some cases a higher power consumption.
Therefore, we run several tests measuring the performance and power consumption on nodes with and without the frequency limit.
Configuration
The tests have been done with two configurations:
- NL (No Limit): Frequency unlimited. Frequency can reach up to 3,7 GHz depending on the Instruction Set and number of cores used.(https://en.wikichip.org/wiki/intel/xeon\_platinum/8160\#Frequencies)
- DEF (Default): Frequency limited to 2,1GHz.
On both configurations the TurboBoost is enabled and the HyperThreading is disabled.
The tests have been done on two different reservations, one with 64 nodes and other with 8 nodes.
The nodes used during the tests are:
- Nodes first reservation: s02r2b[25-48],s03r1b[49-72]
- Nodes second reservation: s22r1b[33-40]
Benchmarks
Different applications had been tested on each reservation:
First reservation:
- Alya (Input: sphere UEABS test case B)
- CP2K
- NAMD
- WRF
- CPMD
- GROMACS
Second reservation:
- HPL
- IO
- PYTHON (Torch)
- GREASY (Python and R tasks)
- VASP
Results
The results are the average of multiple simulations done on nodes with DEF and NL configurations. For each app, we have specified how many simulations were done and how many nodes were used for each simulation.
For each simulation result, we have added a new column where is compared the DEF results with the NL results, dividing the DEF by the NL.
Finally, we have added three rows that show the behaviour of the application during one hour using the 3456 nodes of MareNostrum 4. To calculate the number of simulations in one hour with full-machine, we have divided 3600 seconds by the average time for one simulation. Then we have multiplied this number by the division of 3456 nodes by the nodes used for one single simulation. The nodes used for each simulation depends on the application.
Number of simulation = (3600/Time) * (3456/nodes-per-simulation)
Alya
100 simulations
64 nodes
DEF NL Time (s) 451,00 450,00 Energy (J) 11759807,71 11787738,12 Power (W) 26118,22 26120,76 Power per node (W) 408,10 408,14 Watts-h per job 3266,61 3274,37 Number of simulations in one hour (full-machine) 430,92 432,00 Joules in one Hour (full-machine) 5067536338,39 5092302867,84 Watts-h (full-machine) 1407648,98 1414528,57
Performance and energy consumption is very similar in both cases.
CP2K
10 simulations
64 nodes
DEF NL Time (s) 2236,96 2083,35 Energy (J) 46470584,50 42964343,75 Power (W) 20767,47 20612,57 Power per node (W) 324,49 322,07 Watts-h per job 12908,50 11934,54 Number of simulations in one hour (full-machine) 86,40 93,42 Joules in one Hour (full-machine) 4015058500,80 4013728993,13 Watts-h (full-machine) 1115294,03 1114924,72
CP2K is faster and has low power consumption per node without limiting the frequency to 2,1GHz. Therefore, the total energy needed is lower without the limit.
NAMD
10 simulations
64 nodes
DEF NL Time (s) 990,37 788,08 Energy (J) 17595866,70 15314703,16 Power (W) 17766,89 19432,69 Power per node (W) 277,60 303,63 Watts-h per job 4887,74 4254,08 Number of simulations in one hour (full-machine) 196,02 246,24 Joules in one Hour (full-machine) 3449141790,53 3771092506,12 Watts-h (full-machine) 958094,94 1047525,70
On NAMD we observe a higher power consumption if we remove the frequency limit, but as the simulation is faster without this limit, the energy needed for one simulation is lower than running with the limit.
WRF
40 simulations
64 nodes
DEF NL Time (s) 218,36 216,65 Energy (J) 5584114,75 5177890,65 Power (W) 25572,49 23899,36 Power per node (W) 399,57 373,42 Watts-h per job 1551,14 1438,30 Number of simulations in one hour (full-machine) 890,27 897,30 Joules in one Hour (full-machine) 4971386276,79 4646120204,75 Watts-h (full-machine) 1380940,63 1290588,95
WRF has a similar behavior as CP2K, the simulation is faster, with lower power consumption and with a lower energy consumption.
CPMD
5 simulations
8 nodes
DEF NL Time (s) 488,25 458,50 Energy (J) 1338703,00 1372779,25 Power (W) 2741,84 2994,07 Power per node (W) 342,73 374,26 Watts-h per job 371,86 381,33 Number of simulations in one hour (full-machine) 3185,25 3391,93 Joules in one Hour (full-machine) 4264108357,60 4656371405,89 Watts-h (full-machine) 1184474,54 1293436,50
CPMD we observe that the power consumption is higher without the limit. Also, it is faster, but is not faster enough to have a lower energy consumption than with the 2,1GHz limit.
GROMACS
14 simulations
64 nodes
DEF NL Time (s) 260,75 232,09 Energy (J) 4912196,92 4721792,79 Power (W) 18838,94 20345,01 Power per node (W) 294,36 317,89 Watts-h per job 1364,50 1311,61 Number of simulations in one hour (full-machine) 745,55 837,62 Joules in one Hour (full-machine) 3662289812,91 3955070609,79 Watts-h (full-machine) 1017302,73 1098630,72
GROMACS behavior is similar to NAMD, where the application is faster enough to compensate the higher power consumption without the frequency limit, and therefore run a single simulation with less energy.
HPL
10 simulations
4 nodes
DEF NL Time (s) 25840,90 26471,80 Energy (J) 29052054,00 36467064,60 Power (W) 1124,27 1377,58 Power per node (W) 281,07 344,40 Watts-h per job 8070,02 10129,74 Number of simulations in one hour (full-machine) 120,37 117,50 Joules in one Hour (full-machine) 3496918016,08 4284829808,77 Watts-h (full-machine) 971366,12 1190230,50
HPL is a clear case where the frequency limit is useful and we do not observe any benefit without it.
IO
25 simulations
1 nodes
DEF NL Time (s) 57,64 61,04 Energy (J) 6112,40 6842,48 Power (W) 106,08 111,56 Power per node (W) 106,08 111,56 Watts-h per job 1,70 1,90 Read (MB/s) 136337,67 229237,61 Write (MB/s) 592,15 561,44 Number of simulations in one hour (full-machine) 215850,10 203827,00 Joules in one Hour (full-machine) 1319362176,27 1394682161,99 Watts-h (full-machine) 366489,49 387411,71
The I/O benchmark is a special case where the code runs for 1 minute doing I/O operations. We observe a small increment on the power and energy consumption without the limit, but we observe a huge improvement on the read bandwidth if we remove the frequency limit.
PYTHON (Torch)
25 simulations
1 nodes
DEF NL Time (s) 1121,04 871,68 Energy (J) 337730,20 222738,16 Power (W) 301,27 255,53 Power per node (W) 301,27 255,53 Watts-h per job 93,81 61,87 Number of simulations in one hour (full-machine) 11098,27 14273,13 Joules in one Hour (full-machine) 3748219560,69 3179170213,22 Watts-h (full-machine) 1041172,10 883102,84
The Python code running a PyTorch benchmark is a similar case as we have seen with WRF, a clear case of benefits of removing the frequency limit. Without the limit we observe faster simulations, lower power and lower energy.
GREASY (Python and R tasks)
25 simulations
8 nodes
DEF NL Time (s) 874,20 727,04 Energy (J) 2168997,60 2084698,16 Power (W) 2481,07 2867,41 Power per node (W) 310,13 358,43 Watts-h per job 602,50 579,08 Number of simulations in one hour (full-machine) 1779,00 2139,08 Joules in one Hour (full-machine) 3858642264,38 4459345535,92 Watts-h (full-machine) 1071845,07 1238707,09
On this benchmark, we run 400 GREASY tasks. The tasks are different R and Python codes. We observe that the 400 tasks are finished faster without the frequency limit but the power consumption is higher. Despite this, the total energy consumption is lower without the limit.
VASP
10 simulations
8 nodes
DEF NL Time (s) 3124,11 3709,56 Energy (J) 8998321,22 11204534,89 Power (W) 2880,25 3020,48 Power per node (W) 360,03 377,56 Watts-h per job 2499,53 3112,37 Number of simulations in one hour (full-machine) 497,81 419,24 Joules in one Hour (full-machine) 4479416269,38 4697401487,22 Watts-h (full-machine) 1244282,30 1304833,75
As we have seen on HPL, the VASP benchmark has worst performance, higher energy and power consumption if we remove the frequency limit.
Conclusions
We could classify the results into four categories depending on the performance and energy consumption behaviour. We did not include the I/O benchmark on this classification, as it can be treated as a particular case.
First of all, we have Alya, where we can observe almost the same results in terms of performance and energy with or without the frequency limit. As we have seen on other tests, for example, with the HyperThreading and using fewer tasks per socket, Alya is a particular case.
Then, we observe some applications, as NAMD, CPMD, GROMACS and GREASY, where the Power of each node is higher without the frequency limit. Still, the simulation time is shorter, and that makes less energy consumption for one simulation.
Like the previous applications, we observe that the performance is better without the limit on applications such as CP2K, WRF and Python (Torch), but we also see lower Power. So, these are the cases where energy-saving is higher.
Finally, we observe worse performance on applications as VASP and HPL when we remove the limit to 2,1 GHz of the nodes. In consequence, we see higher energy consumption and Power. Both cases have a high usage of AVX-512 instructions.