- •Matrix computations on systems equipped with GPUs
- •Introduction
- •The evolution of hardware for High Performance Computing
- •The programmability issue on novel graphics architectures
- •About this document. Motivation and structure
- •Motivation and goals
- •Structure of the document
- •Description of the systems used in the experimental study
- •Performance metrics
- •Hardware description
- •Software description
- •The FLAME algorithmic notation
- •The architecture of modern graphics processors
- •The graphics pipeline
- •Programmable pipeline stages
- •The Nvidia G80 as an example of the CUDA architecture
- •The architecture of modern graphics processors
- •General architecture overview. Nvidia Tesla
- •Memory subsystem
- •The GPU as a part of a hybrid system
- •Arithmetic precision. Accuracy and performance
- •Present and future of GPU architectures
- •Conclusions and implications on GPU computing
- •BLAS on single-GPU architectures
- •BLAS: Basic Linear Algebra Subprograms
- •BLAS levels
- •Naming conventions
- •Storage schemes
- •BLAS on Graphics Processors: NVIDIA CUBLAS
- •Evaluation of the performance of NVIDIA CUBLAS
- •Improvements in the performance of Level-3 NVIDIA CUBLAS
- •gemm-based programming for the Level-3 BLAS
- •Systematic development and evaluation of algorithmic variants
- •Experimental results
- •Impact of the block size
- •Performance results for rectangular matrices
- •Performance results for double precision data
- •Padding
- •Conclusions
- •LAPACK-level routines on single-GPU architectures
- •LAPACK: Linear Algebra PACKage
- •LAPACK and BLAS
- •Naming conventions
- •Storage schemes and arguments
- •LAPACK routines and organization
- •Cholesky factorization
- •Scalar algorithm for the Cholesky factorization
- •Blocked algorithm for the Cholesky factorization
- •Computing the Cholesky factorization on the GPU
- •Basic implementations. Unblocked and blocked versions
- •Padding
- •Hybrid implementation
- •LU factorization
- •Scalar algorithm for the LU factorization
- •Blocked algorithm for the LU factorization
- •LU factorization with partial pivoting
- •Computing the LU factorization with partial pivoting on the GPU
- •Basic implementations. Unblocked and blocked versions
- •Padding and hybrid algorithm
- •Reduction to tridiagonal form on the graphics processor
- •The symmetric eigenvalue problem
- •Reduction to tridiagonal form. The LAPACK approach
- •Reduction to tridiagonal form. The SBR approach
- •Experimental Results
- •Conclusions
- •Matrix computations on multi-GPU systems
- •Linear algebra computation on multi-GPU systems
- •Programming model and runtime. Performance considerations
- •Programming model
- •Transfer management and spatial assignation
- •Experimental results
- •Impact of the block size
- •Number of data transfers
- •Performance and scalability
- •Impact of data distribution
- •Conclusions
- •Matrix computations on clusters of GPUs
- •Parallel computing memory architectures
- •Shared memory architectures
- •Distributed memory and hybrid architectures
- •Accelerated hybrid architectures
- •Parallel programming models. Message-passing and MPI
- •ScaLAPACK
- •PLAPACK
- •Elemental
- •Description of the PLAPACK infrastructure
- •Layered approach of PLAPACK
- •Usage of the PLAPACK infrastructure. Practical cases
- •Porting PLAPACK to clusters of GPUs
- •Experimental results
- •Conclusions
- •Conclusions
- •Conclusions and main contributions
- •Contributions for systems with one GPU
- •Contributions for clusters of GPUs
- •Related publications
- •Publications directly related with the thesis topics
- •Publications indirectly related with the thesis topics
- •Other publications
- •Open research lines
- •FLAME algorithms for the BLAS-3 routines
CHAPTER 4. LAPACK-LEVEL ROUTINES ON SINGLE-GPU ARCHITECTURES
4.4.2.Blocked algorithm for the LU factorization
Consider a partitioning of matrices A, L, and U:
A = |
A21 |
A22 |
, L = |
L21 |
L22 |
, U = |
0 |
U22 |
, |
||||
|
|
A11 |
A12 |
|
|
L11 |
0 |
|
|
U11 |
U12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
where A11, L11 and U11 are sub-matrices of size b × b (with b much smaller than the dimension of A); and A22, L22 and U22 are matrices with (n − b) × (n − b) elements. From A = LU:
A = |
A21 |
A22 |
= |
L21 |
L22 |
|
0 |
U22 |
= |
L21U11 |
L21U12 + L22U22 |
, |
|||||
|
|
A11 |
A12 |
|
|
|
L11 |
0 |
|
U11 |
U12 |
|
|
|
L11U11 |
L11U12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
from which the following expressions are obtained:
L11U11 = A11,
U21 = L−111A12,
L21 = A21U11−1,
A22 − L21U12 = L22U22.
Thus, the recursive blocked algorithm for the calculation of the LU factorization of A, can be formulated as follows:
1. Partition A = |
A11 |
A12 |
, with A11 of size b × b. |
A21 |
A22 |
2.A11 := {L\U}11 such that A11 = L11U11 (that is, A11 is overwritten with its LU factorization).
3.AT12 := U12T = L−11T A12.
4.A21 := L21 = A21/U21.
5.A22 := A22 − L21U12T .
6.Continue recursively with A = A22 in step 1.
The cost of this blocked formulation is 2n3/3 flops. In this algorithm, the calculation of the panels A12 and A21, and the update of the sub-matrix A22 are cast in terms of Level-3 BLAS routines. The latter operation involves the major part of the calculations (O(n2b) operations on O(n2) data). As typically b 1, the ratio between arithmetic operations and data accesses is increased, and thus data reutilization is improved.
4.4.3.LU factorization with partial pivoting
As noted before, the solution of a linear system of the form Ax = b exists and is unique if the coe cient matrix A is invertible. However, this is not the only requisite for the existence of the LU factorization. For example, if A is invertible, but any of the elements in the diagonal is zero, the factorization is not feasible, as a division by zero appears during the algorithm (see step 4 of the scalar algorithm presented above).
94
4.4. LU FACTORIZATION
Row pivoting is a strategy that can be applied to solve this issue: if the row with the zero diagonal element is swapped with a di erent row below that in the matrix (without a zero element in the corresponding entry), the factorization can continue normally. Note that these swappings must be applied as well to vector b before solving the linear system.
Another problem that can appear during the LU factorization is caused by diagonal elements with a small magnitude compared with the rest of the elements in its column. This situation involves a growth in the magnitude of the elements of U, with the consequent rounding errors in the presence of limited precision architectures, and a catastrophic impact on the accuracy of the results [144].
A row pivoting strategy that swaps the diagonal element with that of largest magnitude from those in and below the diagonal in the current column, guarantees that every element in L is equal or smaller than 1 in magnitude, and handles properly the growth of the elements in U, limiting the impact of the round-o errors. This technique is known as partial row pivoting, and will be implemented in our GPU codes.
Scalar algorithm for the LU factorization with partial pivoting
The introduction of the partial pivoting in the recursive scalar algorithm asks for two new steps that follow the initial partitioning of the coe cient matrix. First, the element with the largest magnitude in the first column of the current sub-matrix to be factorized has to be found. Second, the first row of the sub-matrix and the row of the selected element must be swapped.
The resulting algorithm is formulated as follows:
|
|
α11 |
a12T |
|
|||
1. |
Partition A = |
|
|
|
. |
|
|
a21 |
A22 |
|
|||||
2. |
π := index of the element with the largest magnitude in |
α11 |
. |
||||
a21 |
|||||||
|
|
a12T ) with the corresponding elements of the π-th row. |
|||||
3. |
Swap the row ( α11 |
|
4.α11 := µ11 = α11.
5.aT12 := uT12 = aT12.
6.a21 := l21 = a21/µ21.
7.A22 := A22 − l21uT12.
8.Continue recursively with A = A22 in step 1.
The same sequence of swappings (or permutations) must be applied to vector b in the solution of the linear system. This application is usually performed after the LU factorization of the coe cient matrix A is computed; thus, it is necessary to store the information that allows the application of this sequence. In practice, a vector of n elements is enough to store the permutation information and recover it afterwards.
To formally include row swapping in the LU factorization, we next introduce permutation matrices. The e ect of a permutation matrix is to rearrange the elements of vectors and entire rows or columns of matrices.
95
CHAPTER 4. LAPACK-LEVEL ROUTINES ON SINGLE-GPU ARCHITECTURES
Definition P Rn×n is a permutation matrix if, when applied to the vector x = (χ0, χ1, . . . ,χn−1)T , rearranges the order of the elements in that vector. Such a permutation can be represented by a vector of integers p = (π0, π1, . . . ,πn−1)T , where {π0, π1, . . . ,πn−1} is a permutation of {0,1, . . . ,n − 1}, and the scalars πi indicate that the permuted vector is given by P x = (χπ0 ,χπ1 , . . . ,χπn−1 ).
A permutation matrix is equivalent to the identity matrix with permuted rows.
Definition If P is the identity matrix with rows i and j swapped, then applying P to a matrix A as P A swaps rows i and j of A.
The algorithm shown above is implemented in LINPACK [56]. In this algorithm, the application of the Gauss transformations and permutations on b must be interleaved, exactly as has been shown for A to obtain U. Thus, if Pk Rn×n and Lk Rn×n are, respectively, the permutation matrix and the Gauss transformation applied in the k-th stage of the algorithm, it follows that A = (P1L0P1L1 . . . Pn−1Ln−1)U and, thus, the solution of the linear system Ax = b can be obtained from Ux = (L−n−1 1Pn−1 . . . L−1 1P1L−0 1P0)b. A side e ect of the interleaving of permutations and Gauss transformations is the impossibility of the derivation of an analogous blocked algorithm, as each Gauss transformation is applied in terms of a rank-1 update (GER routine in BLAS-2).
An alternative approach for row pivoting is implemented in LAPACK [9]. In the implementation proposed in this library, each interchange a ects all elements in the involved rows. Thus, in the step 3 of the scalar algorithm, the elements to the left of the diagonal entry α11, together with ( α11 aT12 ), are permuted with all the elements in the row of index π. Pivoting with complete rows is equivalent to grouping the permutations, and hence it is not necessary to apply them in an interleaved manner with the Gauss transformations. More precisely, if matrix P = Pn−1 . . . P1P0 contains the permutations applied during the factorization, then the solution of the linear system Ax = b is obtained by solving the systems Ly = P b and Ux = y.
Adapting the recursive formulation of the algorithm for the LU factorization to this new pivoting scheme is not straightforward, mainly because the pivoting a ects the sub-matrix on which the calculations are performed at each iteration (A22 in the previous iteration), as well as to the elements that have already been factored in previous iterations. In this case it is more convenient the formulation of the algorithm using the FLAME notation for both the scalar and the blocked algorithms.
Algorithms in FLAME notation for the scalar and blocked LU factorization with partial pivoting
The formal derivation process proposed by the FLAME project drives to five di erent algorithmic variants for the LU factorization. From these variants, only three can be combined with partial pivoting [138]. The scalar and blocked algorithms for these three variants are shown in Figure 4.9 (left and right, respectively). In the FLAME algorithms, n(·) represents the number of columns of a matrix, and P (π1) denotes the permutation matrix that, applied to a second matrix, swaps the first row of it with row π1. If p1 = (π1, π2, . . . , πb), then P (p1) is the permutation matrix that, applied to a second matrix, swaps its first row with row π1, the second row with row π2, and so on, until the last swapping of row b with row πb. This sequence of permutations is gathered by vector p. The procedure PivIndex returns the index of the element of largest magnitude of the vector passed as a parameter.
TRILU(·) returns the strictly lower triangular matrix stored in the matrix passed as a parameter, transforming it to a unit lower triangular matrix. Its usage implies that the operation A12 := TRILU(A11)−T A12 is the solution of a triangular system (routine TRSM in BLAS). The rest of the operations in the algorithm are cast in terms of GEMM-like operations.
96
4.4. LU FACTORIZATION
Algorithm: [A, p] := LU PIV UNB(A)
Partition |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
ABR ! |
, p → |
|
pB ! |
|
|
|
|
|
|
|
|
|
||||||||||||||
A → ABL |
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||
|
|
|
|
AT L |
AT R |
|
|
|
|
|
|
|
|
|
pT |
|
|
|
|
|
|
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
where |
|
AT L is 0 × 0 and pT has 0 elements |
|
|
|
|
||||||||||||||||||||||||||
while n(AT L) < n(A) |
|
do |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||
|
Repartition |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
|
|
AT L AT R → |
|
|
|
|
|
|
|
|
T , |
pT |
→ |
|||||||||||||||||||
|
|
|
T |
|
01 |
|
|
|
|||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
! |
|
|
A00 |
|
a |
A02 |
|
|
|
! |
|
|
p0 |
|
||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
a10 |
|
α11 |
a12 |
|
|
|
|
|
π1 |
|
|||||||||||
|
|
|
ABL |
ABR |
|
|
|
|
|
pB |
|
|
|
||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
A20 |
|
a21 |
A22 |
|
|
|
|
p2 |
|
||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
11 and |
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
|
where α |
|
π |
|
|
are 1×1 |
|
|
|
|
|
|
|
|
|
||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
Variant 1: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
α11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
|
π1 := PivIndex |
|
|
|
|
! |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||
|
|
a21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
|
A20 |
a21 |
|
A22 |
! |
:= P (π1) |
|
A20 |
a21 |
A22 |
! |
|
|
|
|
||||||||||||||||
|
|
|
a10T |
α11 |
|
a12T |
|
|
|
|
|
|
|
|
|
|
|
|
|
a10T |
α11 |
a12T |
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
a21 := a21/α11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
A22 := A22 − a21a12T |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
Variant 2: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
α11 := P (p0) α11 |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||
|
|
|
a01 |
|
|
|
|
|
|
|
|
|
a01 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
a21 |
|
|
|
|
|
a21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
a01 := TRILU(A00)− |
a01 |
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||
|
α11 := α11 − a10T a01 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
a21 := a21 − A20a01 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|
|
α11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
|
π1 := PivIndex |
|
|
|
! |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||
|
|
a21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
|
A20 |
a21 ! := P (π1) |
A20 |
a21 |
! |
|
|
|
|
|
|
|
|||||||||||||||||||
|
|
|
aT |
α11 |
|
|
|
|
|
|
|
|
|
|
|
aT |
|
α11 |
|
|
|
|
|
|
|
||||||||
|
10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
|
|
|
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
a21 := a21/α11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
Variant 3: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
α11 := α11 − aT a01 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
|
|
|
|
|
10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
a21 := a21 − A20a01 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
a12T := a12T |
|
− a10T A02 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
α11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
|
π1 := PivIndex |
|
|
|
! |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||
|
|
a21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
|
A20 |
a21 |
|
A22 |
! |
:= P (π1) |
|
A20 |
a21 |
A22 |
! |
|
|
|
|
||||||||||||||||
|
|
|
a10T |
α11 |
|
a12T |
|
|
|
|
|
|
|
|
|
|
|
|
|
a10T |
α11 |
a12T |
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
a21 := a21/α11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
Continue with |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
|
|
AT L AT R ← aT |
|
α11 |
|
aT , |
pT |
← π1 |
||||||||||||||||||||||||
|
|
|
|
|
|||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
! |
|
|
A00 |
|
a01 |
|
A02 |
|
|
|
! |
|
|
p0 |
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
10 |
|
|
|
|
|
12 |
|
|
|
|
|
|
|
||||||||
|
|
|
ABL |
ABR |
|
|
|
|
|
|
|
|
|
|
pB |
|
|
|
|
||||||||||||||
|
|
|
|
|
|
|
|
|
|
A20 |
|
a21 |
|
A22 |
|
|
|
|
p2 |
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
endwhile
Algorithm: [A, p] := LU PIV BLK(A)
Partition |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
! |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
! |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
A → |
|
|
AT L |
AT R |
|
, |
|
p → |
|
|
|
|
pT |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||
|
|
ABL |
ABR |
|
|
|
|
|
pB |
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||
|
where |
AT L is 0 × 0 and pT has 0 elements |
|
|
|
|
|
||||||||||||||||||||||||||||||||||||||
while |
|
n(AT L) < n(A) |
|
|
do |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||
|
Determine block size b |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||
|
Repartition |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
|
AT L |
|
AT R → |
A00 |
|
|
|
|
A01 |
|
|
A02 |
, |
|
pT |
|
→ |
p0 |
|
|||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||
|
|
|
|
|
|
|
|
! |
|
|
|
|
A10 |
|
|
|
|
A11 |
|
|
A12 |
|
|
|
|
! |
|
|
p1 |
|
|
|
|||||||||||||
|
|
ABL |
|
ABR |
|
|
|
|
|
|
|
pB |
|
|
|
|
|
||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
A20 |
|
|
|
|
A21 |
|
|
A22 |
|
|
|
|
p2 |
|
||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
A11 is b |
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||
|
where |
|
|
|
|
|
|
|
×b and p |
|
|
|
has b elements |
|
|
|
|
|
|
|
|||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
Variant 1: |
# |
:= LU PIV UNB |
A21 |
! |
|
|
|
|
|
|
|
|
||||||||||||||||||||||||||||||||
|
" |
|
A21 |
! , p1 |
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||
|
|
|
A11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A11 |
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
A22 ! |
|
:= P (p1) |
|
|
|
A20 |
|
|
A22 ! |
|
|
|
|
|
|
|
|
||||||||||||||||||||||
|
|
A20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||||
|
|
A10 |
|
|
|
A12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A10 |
|
|
A12 |
|
|
|
|
|
|
|
|
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A12 := TRILU(A11)−1A12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||
|
A22 := A22 − A21A12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
|
Variant 2: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
A11 := P (p0) A11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
|
|
A01 |
|
|
|
|
|
|
|
|
|
|
|
A01 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
A21 |
|
|
|
A21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||
|
A01 := TRILU(A00)− |
A01 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||
|
A11 := A11 − A10A01 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||
|
A21 := A21 − A20A01 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
! |
|
|
|
|
|
|
|
|
||||||||||||||||||||
|
" |
|
A21 |
! , p1 |
# |
:= LU PIV UNB |
A21 |
|
|
|
|
|
|
|
|
||||||||||||||||||||||||||||||
|
|
|
A11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A11 |
|
|
|
|
|
|
|
|
|
|
|||
|
|
! |
|
|
|
|
|
|
|
|
A10 |
! |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||
|
|
A10 |
|
|
|
:= P (p1) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
A20 |
|
|
|
A20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||
|
Variant 3: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
A11 := A11 − A10A01 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||
|
A21 := A21 − A20A01 |
|
|
|
|
|
|
|
|
|
|
|
|
|
A21 |
! |
|
|
|
|
|
|
|
|
|||||||||||||||||||||
|
" |
|
A21 |
! , p1 |
# |
:= LU PIV UNB |
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||
|
|
|
A11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A11 |
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
A22 ! |
|
:= P (p1) |
|
|
|
|
|
|
|
A22 ! |
|
|
|
|
|
|
|
|
||||||||||||||||||||||
|
|
A20 |
|
|
|
|
|
|
|
A20 |
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||
|
|
A10 |
|
|
|
A12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A10 |
|
|
A12 |
|
|
|
|
|
|
|
|
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A12 := A12 − A10A02 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||
|
A12 := TRILU(A11)−1A12 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||
|
Continue with |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||
|
|
AT L AT R ← A10 |
|
A11 |
|
|
|
A12 |
, |
|
pT |
← p1 |
|||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
! |
|
|
|
|
A00 |
|
A01 |
|
|
|
A02 |
|
|
|
|
! |
|
p0 |
|
|
|
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
A20 |
|
A21 |
|
A22 |
|
|
|
p2 |
|
||||||||||||||||||||||||
|
|
ABL |
|
ABR |
|
|
|
pB |
|||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
endwhile
Figure 4.9: Algorithmic variants for the scalar (left) and blocked (right) algorithms for the LU factorization with partial pivoting.
97