Trace generation: just some examples

Paraver specifies a trace format and some mechanisms how the records and the values encoded will be processed in the visualization. Every record specifies the object to which it refers (indicating application task and thread) and the absolute time at which it happens. For each type of record, some additional fields can be encoded as desired by the user. These fields are:

  • State records include an integer value that is usually referred to as the state.
  • Event records include a user event type and a user event value.
  • Relation/Communication records include a communication tag and a communication size. Even if they have this name because it is usual to encode the tag and size of a MPI communication, Paraver does not rely on these semantics.

The flexibility of this approach makes it possible to use Paraver for many types of analyses. It is quite easy to implement instrumentation tools for many systems and purposes. The main issue in such instrumentation is how to encode the information in the fields available in the record formats. Special emphasis should be put in a proper selection of what to encode as state and what (and how) to encode as events. It is our experience that a clean design of these encoding concepts results in studies being later carried out with Paraver that were not foreseen when the analysis was planned.

In this section we briefly describe the encoding criteria of several instrumentation/trace generation tools that we distribute (see the Software Distribution section). For more detailed information about these tools refer to the Tool Documentation section.

All the tools described on this page generate also a paraver configuration file with simbolic information that includes for instance the function names and that facilitates to relate the trace information with the application source code. Most of the tools allow to include explicit instrumentation (selected points, variables, hardware counters) and to stop/resume the tracing trough calls to the tool library.

The programs must be executed on dedicated resources to avoid the large perturbations that OS scheduling may cause in the presence multiple concurrent applications.

OpenMP

The OMPtrace tool instruments parallel codes that use the OpenMP programming model. This tool generates a Paraver trace file where the basic activity in an OpenMP program is recorded. Paraver state and flag records are emmited to reflect the evolution of the application behaviour. With Paraver the user can visualize the execution at thread, task or application level.

The major encoding choices are:

  • States: will record whether the thread is Idle (waiting for work), Scheduling (generating work/notifying termination), Running (executing application code) or doing I/O.
  • Events:
    • Entry and exit to/from a parallel region tracing bot the parallel loop/sections and the parallel directives.
    • Entry and exit to a work sharing construct (with a different value for do/sections and single section).
    • For each lock, an event type is reserved, and different values are emitted when willing to get, owning or releasing the lock.
    • Value of the hardware counters selected.
    • In IBM platforms, the entry and exit to/from the user routines that include OpenMP directives. Additional routines can be traced using an environment variable.

Besides getting a qualitative graphical perception of program behavior, this encoding makes it possible to visualize and measure the load balance, the profile of parallelism achieved, the percentage of time inside a mutual exclusion, the conflicts in getting locks and percentage of sequential parts among others.

OMPtrace is currently available on SGI-IRIX and IBM SP machines.



MPI


The MPItrace tool instruments parallel codes that use the message passing (MPI) programming model. This tool generates a Paraver trace file where the basic activity in an MPI program is recorded. The MPItrace tool assumes each MPI process is single threaded. A tracefile represents a single MPI program run, thus it includes only one application with several tasks (as stated in the mpirun command) and one thread per task

The major encoding choices are:

  • States: will record whether the thread is Running, Waiting for Messages or doing I/O.
  • Communication: The tag and size are set according to those in the calls. Physical communication is assumed to be identical to logical communication as it is not possible through the MPI instrumentation to find out when the actual data transfer takes place.
  • Events: are used to tag the beginning and ending of MPI operations, such as Barriers, Broadcast, AlltoAll, and all kind of Send - Receive calls.

This instrumentation module provides the typical message passing visualization functionalities.

MPItrace is currently available on SGI-IRIX, IBM-SP and Linux platforms.



OpenMP+MPI


The OMPItrace tool instruments parallel codes based on the OpenMP programming model and/or applications using the message passing (MPI) programming model. This tool generates a Paraver trace file where the basic activity of the program is recorded.

 

The major encoding choices are:

  • States: will record whether the thread is Idle (waiting for work), Running (application code), Scheduling (generating work/notifying termination), Waiting for Messages or doing I/O.
  • Communication: The tag and size are set according to those in the calls. Physical communication is assumed to be identical to logical communication as it is not possible through the MPI instrumentation interface to find out when the actual data transfer takes place.
  • Events are used to tag the basic program activity. For example:
    • to mark the entry to a parallel region.
    • to mark the entry to a work sharing construct.
    • to read the value of the hardware counters.
    • to tag the beginning and ending of MPI operations, such as Barriers, Broadcast, AlltoAll, and all kind of Send - Receive calls.

OMPItrace is currently available on SGI-IRIX and IBM-SP platforms.



Java and Application Servers

The analysis and visualization of Java Applications is based on two specific tools: JIS (Java Instrumentation Suite) and JACIT (Java Automatic Code I nterposition Tool). They are complementary and can be used to get very detailed traces of the execution of Java bytecodes without recompilation. The whole environement is especially intended to perform Performance Analysis of J2EE Application Servers, and has been succesfully tested on WebSphere 4 .x and on Jboss 3.x.

JIS is available for Linux 2.4 and 2.5/2.6 platforms and JACIT is a cross-platform Java tool. The Java Instrumentation Suite (JIS) gets detailed information from all the levels involved in the execution of J2EE applications: System, JVM proces s, Middleware (i.e. J2EE appserver) and User Application. This information is automatically generated as a Paraver tracefile. All the levels are corr elated to offer a global view of the system execution. To summarize, the information collected from each level of JIS is described below:

  • System level: Thread scheduling information (extracted from inside the kernel scheduler) and detailed information of the system calls performed by the JVM process
  • JVM level: Information from the Java threads is offered (such as their names) and put in relation with system threads. JVM monitors and raw monitors are also instrumented on this level. All information is extraced through the JVMPI (Java Virtual Machine Profiler Interface).
  • Middleware level: Information from the middleware architecture components status is offered by this level, shown in the generated tracefile as Paraver events on boundaries of software components.
  • Application level: User generated events can be produced from the Java application bytecode, that later will be displayed as Paraver events. A native C library is provided with JIS to allow Java applications to generate user level events on the Paraver trace produced by JIS, using the Java Native Interface (JNI).

The Java Automatic Code Interposition Tool (JACIT) is a cross-platform java tool designed to make it easy the task of inserting probes on Java codes. With a user-friendly graphical interface, JACIT allows the insertion of pieces of Java code (inclunding JNI calls to C or C++ libraries) to be execu ted before or after any of the methods of a java existing bytecode without need of recompilation. As a possible use, interposed code can be composed of calls to a native library interface to JIS.



Performance counters

The infoPerfex tool relies on the SGI perfex tool and the hardware performance counters interface to generate a trace containing the values of the performance counters sampled at periodic intervals. infoPerfex can instrument running applications without having the source code.

The trace only contains events for a single thread in a single application. Several types of events may appear in a trace: system calls, context switches, bytes read, bytes written... and the two selected performance counters (cache misses, floating point operations, TLB misses...). For all of them the value field represents the actual count in the previous sampling interval.

The profile of the above type of events can be displayed with Paraver. This profile can provide useful information about periodic patterns, phases in the program... This is quite more useful than only having the global total number.

 

System activity

The SCPUs tool instruments the operating system scheduling. It uses the /proc interface to obtain information about the existing processes. It can generate a Paraver trace file where the execution and scheduling of the processes is recorded.

SCPUs uses all the levels of Paraver process model (thread, task and application ) and it also records information about the activity of the different CPUs.

The trace contains two types of records:

 

  • States: encode the application. The CPU view shows the application that is active on each processor. Parallel applications use the same state for all their threads/processes, so the whole application could be painted using the same color.
  • Communication: represent the migration of one process between two processors. It encodes as tag of the message the pair application task to which the process belongs. The size field encodes the thread/process number within the application.

With this encoding it is possible to measure the total number of process migrations, to visualize the migrations suffered by one application, to compute the total system utilization or to display the profile of processors allocated to one application.

 

SCPUS is currently available on SGI-IRIX machines.



NanosCompiler

The NanosCompiler allows the instrumentation of parallel applications. The instrumentation is based on the generation of calls to an instrumentation library that gathers information from the hardware counters of the machine, records the execution status of each thread and inserts events related to the OpenMP directives.

The major encoding choices are:


  • States: are used to indicate the current status of each thread: idle (light blue), running (dark blue), blocked (red), creating work (yellow), or library (green).
  • Events: are used to signal events during the execution; they have associated types and values related to the original program and OpenMP parallelisation, and to display performance statistics (cache misses, invalidations, ...) gathered from hardware counters

Dimemas



The Dimemas simulator generates message passing traces with similar encoding as the MPItrace. The major difference is in the specification of the communication. Dimemas can differentiate between startup and transfer in a communication. So the traces generated by it have an additional state that encodes the startup part of a communication. Also quite interesting is that these traces really differentiate between physical communication (actual data transfer through the network) and logical communication (from the send request till the return of the receive request).

The Dimemas simulator reconstructs the time behavior of a parallel application on a machine modelled by a set of performance parameters. Thus, performance experiments can be done easily. The supported target architecture classes include networks of workstations, single and clustered SMPs, distributed memory parallel computers, and even heterogeneous systems.

For more information on Dimemas click here.



UTE translator

ute2paraver is a filter that translates UTE traces to the Paraver format. UTE is a tracing tool for IBM SP systems that obtains a fair amount of information about the activity of SP systems running MPI applications (or MPI+OpenMP). In addition to process activity, UTE records scheduling information.

The traces thus obtained with ute2paraver can be looked at through the Process model and Resource model perspective. With the first one, the activity of each thread, how may active threads has each MPI process or the instantaneous parallelism profile can be visualized.

With the Resource model, the scheduling of threads to processors or specific activity of each processor can be analyzed.



AIX Trace translator

 

aix2prv is a filter that translates traceso obtained with the IBM AIX trace facility to the Paraver format. The AIX trace facility allows to collect very low level information on the processes scheduling, system calls... for all the processes running on a SP node.

With this translator now we are able to use all the flexibility and potential of Paraver to analyze the low level detail information captured by the AIX trace facility.

With this new module we can study:

  • the impact of the system processes on the computing applications
  • the migrations between cpus of all the processes and the resources distribution
  • some internals on the libraries implementations
  • ...

MLP instrumentation


MLP is a programming model developed at NASA AMES where shared memory regions are allocated to perform the communication between processes. To support the instrumentation of MLP applications we created a special version of OMPItrace that intercepts the fork call. Gabriele Jost from NASA-AMES developed the instrumented version of the library. This functionality allow them to compare the efficiency of the MLP with respect to other programming models like MPI+OpenMP.

We have developped our own version of the MLP library and modified OMPItrace to intercept the MLP library calls including information of the hardware counters related to memory accesses. We are currently stuing the kind of information provided by these counters and how it can be used to analyze effect of memory placement on the performance of MLP programs.


Tracedrive preprocessing



Tracedrive is a new module we have developed during the last year to help on the instrumentation process. The first issue faced when tracing a large and unknown code is to identify the structure of the application. To avoid having to look at the source code of applications with hundreds of files, we developed a module that dynamically instrument an already running binary to collect information on the dynamic tree call. This information can be later analyzed with a gui interface to select the set of significative user routines to instrument with OMPItrace.