LU Example
The algorithm
The LU factorization is a more generial factorization than Cholesky one. The LU factorization deals with non symetric matrixes so it calculates one lower triangular matrix (L) and one upper triangular(U) matrix which product fits with a permutation of rows of the original.
Perm(A)=L*U
For more detailed explanation of the algorithm follow this link.
The source code for SMPSs
The following figure graphically describes how the different primitives operate upon the matrix blocks. Note that the panel factorization (the sgetf2 operation) breaks the parallelism because it can't be divided in blocks so we need to wait the tasks who calculate the blocks to finish their work.

The following code is the main algorithm of a blocked LU factorization. The matrix A is organized in blocks of NB x NB floats, with a total of DIM x DIM blocks. The annotated application primitives (tasks) operate on these blocks. Either NB and DIM can be set by the user in run time.

Each function performs a block operation which can be annotated in order to be executed in an SPU as tasks.

Some results
- Scalability
![]() |
![]() |
|
LU factorization with permutations |
LU factorization without permutations |
- Performance
![]() |
![]() |
|
LU factorization with permutations |
LU factorization without permutations |
Test Machine
The machine used for these tests has 2 power5 processors at 1.5 GHz. The power5 has 2 cores and each core has 2 FPU (Floating Point Units), the theoretical peak for each core is 6 GFlops. Then, the peak of the machine is up to 24 GFlops. In addition, each core is SMT so the best performance reached using two threads in the core.
Downloads
LU example source files








