CS 6404 ScaLAPACK Project Overview APR. 29, 1996 What is ScaLAPACK? ScaLAPACK is a library of high performance linear algebra routines for distributed memory MIMD computers. It is a continuation of the LAPACK project. Both libraries is for solving systems of linear equations, least squares problems and eigenvalue problems. The development of ScaLAPACK begin in 1991 and is still being constructed. The motivation of the work of ScaLAPACK : To increase the availability of numerical linear algebra software on advanced-architecture computers. The history of linear algebra libraries begins with EISPACK: a collection of Fortran subroutines stressing efficiency coupled with elegance. Later came LINPACK: a collection of Fortran subroutines that increased efficiency through column-oriented algorithms. Then LAPACK was created to increase performance on newer computers with memory hierarchies. LAPACK used block-oriented algorithms to reduce data movement between different levels of memory as much as possible. ScaLAPACK was designed to extend LAPACK routines so that these routines could run efficiently on distributed memory, concurrent computers. The Organization of ScaLAPACK: ScaLAPACK is mainly consist of PBLAS (Parallel Basic linear Algebra Subprograms), LAPACK, BLACS (Basic Linear Algebra Communication Subprogram), and BLAS. The message passing primitives are MPI, PVM, MPL,GAM and so on. The Function of ScaLAPACK: The ScaLAPACK currently provide the following computation functions: a. Direction factorization, such as LU ,QR decomposition and LLT Cholesky decomposition. b. QR factorization with column pivoting (QRP), RQ, LQ and QL factorizations. c. Triangular inversion ( TRI ), upper Hessenberg reduction ( HRD ) , tridiagenproblem (TRD), bidiagonal reduction (BRD) ,and the symmetric eigenroblem ( SEP ). The Design goals of ScaLAPACK: ScaLAPACK is designed for high performance on a variety of machines, by a number of users for a large range of problems. High performance means that it is designed to be as fast and efficient as possible on a large number of different architectures. ScaLAPACK also attempts to be easy to use for most users. The goal is to eventually have the input procedures for ScaLAPACK to be just like those of LAPACK so that everyone familiar with LAPACK is comfortable with ScaLAPACK. ScaLAPACK is also designed to be numerically stable and rigorous for a variety of problems even ill-conditioned ones. The Data Distribution in ScaLAPACK: The goal of designing the data distribution is good load-balance, good computation efficiency and equal memory usage between processors. So the ScaLAPACK library uses the block-cyclic data distribution on a virtual grid of processors. In particularly, the distribution of a MATRIX is define by block width (r), block height (s), the number of processor in a row (P) and the number of processor in a column (Q). The BLAS, PBLAS and the BLACS: The BLAS are a kernel of highly streamlined subroutines designed to do matrix and vector computations as efficiently as possible. They have been designed to be as proficient as possible on a variety of architectures. The PBLAS are basically the BLAS extended for use in parallel programming. The BLACS were designed to be the BLAS of message passing. The BLACS are a kernel of highly optimized communication subroutines designed specifically for linear algebra computations. They include many features that provide great efficiency while hiding the lower level details in the PBLAS where the average user does not have to bother with them. Factorization Routines: Cholesky Decomposition: This routine is used to solve Hermitian (symmetric if the matrix is real), positive-definite matrices that arise in a number of fields (statistics, algebra, engineering etc.). It is more efficient than the LU factorization and is considered numerically stable. QR Decomposition: This routine is used to factor a matrix A into and upper triangular matrix R and a unitary (Q*=Q) matrix Q. If the matrix A meets certain requirements the QR Decomposition can provide a Schur Decomposition which yields the eigenvalues of A. While the QR is not highly efficient, it is one of the oldest and most stable ways of computing the Schur Decomposition of A. Results: We ran the QR and Cholesky factorizations on a network of four workstations with the message passing primitive PVM. We varied the block size in each of the computations as well as the size of the process grid and measured the performance of the algorithm for different sized matrices.