Analysis of scalability problems

Introduction

The objective of this study was to find out why a certain F90 application did not scale as well as expected. The user suspected that it was due to the call to intrinsic routines for matrix multiplication and wanted to use Paraver to proof this.

In this study we compared some basic metrics of the 1 process and 4 processes tracefiles to check if we can see some differences that will justify the scalability problems.


Visual analysis

The first recommended step is to have a visual look of the traces. In our case we obtained the following views for the 4 threads trace:

We can see some unbalance between processes on the 3 main loops. This unbalance is one of the reasons for the scalability problems, but we can continue the analysis having a look on some metrics that characterize the program and the performance of the run.

Comparing the program characteristics

The first objective is to check if the observed unbalance can be confirmed at the level of instructions. We can see in in the next figure that there is an important unbalance of work between threads. This unbalance on the number of instructions is one of the problems to achieve good scalability.

Now we can compare the number of instructions of the two tracefiles

There is an important increase of the total instructions on the 3 main loops (from 115000000 to 189000000 aprox.) while other loops maintain the number of instructions. This increase of instructions will be another reason of the poor scalability.

It will be interesting to instrument within the big loops to identify where is the increase of instructions. To validate the users theory on the intrinsic calls, we can check if these loops are the ones that call to the routines for matrix multiplication

 


Comparing the performance behaviour

We can also compare the performance of both runs. The suggested metrics to check are:

  • The number of instructions per us. As we can see in the next two figures there is an important reduction (aprox. from 400 to 270). This is the 3rd reason we found to explain the poor scalability results.




  • The number of L2 misses. In this case there is a very important increase that maybe will justify the reduction on the MIPS ratio.






Conclusions

In this page we show the type of simple studies that can be performed to compare to tracefiles. In the described example the objective is to study the scalability problems of a program with different number of processors.

The comparative analysis should be based on applying to the different traces the same configuration files. The study starts with simple views of parallel process activity that can reveal potential load balance. It is very important not to stop there even if some cause of poor scalability is easily identified in the gaphical views. The analyst should evaluate further potential causes. 2D profiles for each parallel function of metrics derived from hardware counters information can be easily computed and may point out additional performance problems.

In this example, we have identified that it is a combination of several factors that leads to the scalability problem of the application. In our case those factors are real load imbalance in terms of instructions, increase in the number of instructions executed by the algorithm and increase in the number of L2 misses.