Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Санкт-Петербургский государственный электротехнический университет "ЛЭТИ"

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

MatrixCUDAFranDissertation.pdf

Скачиваний:

Добавлен:

22.03.2016

Размер:

2.18 Mб

Скачать

☆

<<< < Предыдущая 12 13 14 15 16 17 18 19 20 21 22 2324 / 4724 25 26 27 28 29 30 31 32 33 34 35 36 > Следующая >>>

CHAPTER 4. LAPACK-LEVEL ROUTINES ON SINGLE-GPU ARCHITECTURES

4.4.2.Blocked algorithm for the LU factorization

Consider a partitioning of matrices A, L, and U:

A =

A21

A22

, L =

L21

L22

, U =

U22

A11

A12

L11

U11

U12

where A11, L11 and U11 are sub-matrices of size b × b (with b much smaller than the dimension of A); and A22, L22 and U22 are matrices with (n − b) × (n − b) elements. From A = LU:

A =

A21

A22

L21

L22

U22

L21U11

L21U12 + L22U22

A11

A12

L11

U11

U12

L11U11

L11U12

from which the following expressions are obtained:

L11U11 = A11,

U21 = L−111A12,

L21 = A21U11−1,

A22 − L21U12 = L22U22.

Thus, the recursive blocked algorithm for the calculation of the LU factorization of A, can be formulated as follows:

1. Partition A =	A11	A12	, with A11 of size b × b.
	A21	A22

2.A11 := {L\U}11 such that A11 = L11U11 (that is, A11 is overwritten with its LU factorization).

3.AT12 := U12T = L−11T A12.

4.A21 := L21 = A21/U21.

5.A22 := A22 − L21U12T .

6.Continue recursively with A = A22 in step 1.

The cost of this blocked formulation is 2n3/3 ﬂops. In this algorithm, the calculation of the panels A12 and A21, and the update of the sub-matrix A22 are cast in terms of Level-3 BLAS routines. The latter operation involves the major part of the calculations (O(n2b) operations on O(n2) data). As typically b 1, the ratio between arithmetic operations and data accesses is increased, and thus data reutilization is improved.

4.4.3.LU factorization with partial pivoting

As noted before, the solution of a linear system of the form Ax = b exists and is unique if the coe cient matrix A is invertible. However, this is not the only requisite for the existence of the LU factorization. For example, if A is invertible, but any of the elements in the diagonal is zero, the factorization is not feasible, as a division by zero appears during the algorithm (see step 4 of the scalar algorithm presented above).

4.4. LU FACTORIZATION

Row pivoting is a strategy that can be applied to solve this issue: if the row with the zero diagonal element is swapped with a di erent row below that in the matrix (without a zero element in the corresponding entry), the factorization can continue normally. Note that these swappings must be applied as well to vector b before solving the linear system.

Another problem that can appear during the LU factorization is caused by diagonal elements with a small magnitude compared with the rest of the elements in its column. This situation involves a growth in the magnitude of the elements of U, with the consequent rounding errors in the presence of limited precision architectures, and a catastrophic impact on the accuracy of the results [144].

A row pivoting strategy that swaps the diagonal element with that of largest magnitude from those in and below the diagonal in the current column, guarantees that every element in L is equal or smaller than 1 in magnitude, and handles properly the growth of the elements in U, limiting the impact of the round-o errors. This technique is known as partial row pivoting, and will be implemented in our GPU codes.

Scalar algorithm for the LU factorization with partial pivoting

The introduction of the partial pivoting in the recursive scalar algorithm asks for two new steps that follow the initial partitioning of the coe cient matrix. First, the element with the largest magnitude in the ﬁrst column of the current sub-matrix to be factorized has to be found. Second, the ﬁrst row of the sub-matrix and the row of the selected element must be swapped.

The resulting algorithm is formulated as follows:

		α11	a12T
1.	Partition A =			.
		a21	A22
2.	π := index of the element with the largest magnitude in				α11	.
					a21
			a12T ) with the corresponding elements of the π-th row.
3.	Swap the row ( α11

4.α11 := µ11 = α11.

5.aT12 := uT12 = aT12.

6.a21 := l21 = a21/µ21.

7.A22 := A22 − l21uT12.

8.Continue recursively with A = A22 in step 1.

The same sequence of swappings (or permutations) must be applied to vector b in the solution of the linear system. This application is usually performed after the LU factorization of the coe cient matrix A is computed; thus, it is necessary to store the information that allows the application of this sequence. In practice, a vector of n elements is enough to store the permutation information and recover it afterwards.

To formally include row swapping in the LU factorization, we next introduce permutation matrices. The e ect of a permutation matrix is to rearrange the elements of vectors and entire rows or columns of matrices.

CHAPTER 4. LAPACK-LEVEL ROUTINES ON SINGLE-GPU ARCHITECTURES

Deﬁnition P Rn×n is a permutation matrix if, when applied to the vector x = (χ0, χ1, . . . ,χn−1)T , rearranges the order of the elements in that vector. Such a permutation can be represented by a vector of integers p = (π0, π1, . . . ,πn−1)T , where {π0, π1, . . . ,πn−1} is a permutation of {0,1, . . . ,n − 1}, and the scalars πi indicate that the permuted vector is given by P x = (χπ0 ,χπ1 , . . . ,χπn−1 ).

A permutation matrix is equivalent to the identity matrix with permuted rows.

Deﬁnition If P is the identity matrix with rows i and j swapped, then applying P to a matrix A as P A swaps rows i and j of A.

The algorithm shown above is implemented in LINPACK [56]. In this algorithm, the application of the Gauss transformations and permutations on b must be interleaved, exactly as has been shown for A to obtain U. Thus, if Pk Rn×n and Lk Rn×n are, respectively, the permutation matrix and the Gauss transformation applied in the k-th stage of the algorithm, it follows that A = (P1L0P1L1 . . . Pn−1Ln−1)U and, thus, the solution of the linear system Ax = b can be obtained from Ux = (L−n−1 1Pn−1 . . . L−1 1P1L−0 1P0)b. A side e ect of the interleaving of permutations and Gauss transformations is the impossibility of the derivation of an analogous blocked algorithm, as each Gauss transformation is applied in terms of a rank-1 update (GER routine in BLAS-2).

An alternative approach for row pivoting is implemented in LAPACK [9]. In the implementation proposed in this library, each interchange a ects all elements in the involved rows. Thus, in the step 3 of the scalar algorithm, the elements to the left of the diagonal entry α11, together with ( α11 aT12 ), are permuted with all the elements in the row of index π. Pivoting with complete rows is equivalent to grouping the permutations, and hence it is not necessary to apply them in an interleaved manner with the Gauss transformations. More precisely, if matrix P = Pn−1 . . . P1P0 contains the permutations applied during the factorization, then the solution of the linear system Ax = b is obtained by solving the systems Ly = P b and Ux = y.

Adapting the recursive formulation of the algorithm for the LU factorization to this new pivoting scheme is not straightforward, mainly because the pivoting a ects the sub-matrix on which the calculations are performed at each iteration (A22 in the previous iteration), as well as to the elements that have already been factored in previous iterations. In this case it is more convenient the formulation of the algorithm using the FLAME notation for both the scalar and the blocked algorithms.

Algorithms in FLAME notation for the scalar and blocked LU factorization with partial pivoting

The formal derivation process proposed by the FLAME project drives to ﬁve di erent algorithmic variants for the LU factorization. From these variants, only three can be combined with partial pivoting [138]. The scalar and blocked algorithms for these three variants are shown in Figure 4.9 (left and right, respectively). In the FLAME algorithms, n(·) represents the number of columns of a matrix, and P (π1) denotes the permutation matrix that, applied to a second matrix, swaps the ﬁrst row of it with row π1. If p1 = (π1, π2, . . . , πb), then P (p1) is the permutation matrix that, applied to a second matrix, swaps its ﬁrst row with row π1, the second row with row π2, and so on, until the last swapping of row b with row πb. This sequence of permutations is gathered by vector p. The procedure PivIndex returns the index of the element of largest magnitude of the vector passed as a parameter.

TRILU(·) returns the strictly lower triangular matrix stored in the matrix passed as a parameter, transforming it to a unit lower triangular matrix. Its usage implies that the operation A12 := TRILU(A11)−T A12 is the solution of a triangular system (routine TRSM in BLAS). The rest of the operations in the algorithm are cast in terms of GEMM-like operations.

4.4. LU FACTORIZATION

Algorithm: [A, p] := LU PIV UNB(A)

Partition

ABR !

, p →

pB !

A → ABL

AT L

AT R

where

AT L is 0 × 0 and pT has 0 elements

while n(AT L) < n(A)

Repartition

AT L AT R →

T ,

→

A00

A02

a10

α11

a12

π1

ABL

ABR

A20

a21

A22

11 and

where α

are 1×1

Variant 1:

α11

π1 := PivIndex

a21

A20

a21

A22

:= P (π1)

A20

a21

A22

a10T

α11

a12T

a10T

α11

a12T

a21 := a21/α11

A22 := A22 − a21a12T

Variant 2:

α11 := P (p0) α11

a01

a21

a01 := TRILU(A00)−

a01

α11 := α11 − a10T a01

a21 := a21 − A20a01

α11

π1 := PivIndex

a21

A20

a21 ! := P (π1)

A20

a21

α11

a21 := a21/α11

Variant 3:

α11 := α11 − aT a01

a21 := a21 − A20a01

a12T := a12T

− a10T A02

α11

π1 := PivIndex

a21

A20

a21

A22

:= P (π1)

A20

a21

A22

a10T

α11

a12T

a10T

α11

a12T

a21 := a21/α11

Continue with

AT L AT R ← aT

α11

aT ,

← π1

A00

a01

A02

ABL

ABR

A20

a21

A22

endwhile

Algorithm: [A, p] := LU PIV BLK(A)

Partition

A →

AT L

AT R

p →

ABL

ABR

where

AT L is 0 × 0 and pT has 0 elements

while

n(AT L) < n(A)

Determine block size b

Repartition

AT L

AT R →

A00

A01

A02

→

A10

A11

A12

ABL

ABR

A20

A21

A22

A11 is b

where

×b and p

has b elements

Variant 1:

:= LU PIV UNB

A21

! , p1

A11

A22 !

:= P (p1)

A20

A22 !

A20

A10

A12

A10

A12

A12 := TRILU(A11)−1A12

A22 := A22 − A21A12

Variant 2:

A11 := P (p0) A11

A01

A21

A01 := TRILU(A00)−

A01

A11 := A11 − A10A01

A21 := A21 − A20A01

A21

! , p1

:= LU PIV UNB

A21

A11

A10

:= P (p1)

A20

Variant 3:

A11 := A11 − A10A01

A21 := A21 − A20A01

A21

! , p1

:= LU PIV UNB

A11

A22 !

:= P (p1)

A22 !

A20

A10

A12

A10

A12

A12 := A12 − A10A02

A12 := TRILU(A11)−1A12

Continue with

AT L AT R ← A10

A11

A12

← p1

A00

A01

A02

A20

A21

A22

ABL

ABR

endwhile

Figure 4.9: Algorithmic variants for the scalar (left) and blocked (right) algorithms for the LU factorization with partial pivoting.

<<< < Предыдущая 12 13 14 15 16 17 18 19 20 21 22 2324 / 4724 25 26 27 28 29 30 31 32 33 34 35 36 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
22.03.20161.06 Mб224MATER_3.doc
#
18.11.2019295.42 Кб0MATLAB-1.doc
#
19.11.2019203.78 Кб0MATLAB-2.doc
#
09.02.20153.49 Mб22MATLAB-3.doc
#
09.02.2015344.3 Кб10Matrices.pdf
#
22.03.20162.18 Mб14MatrixCUDAFranDissertation.pdf
#
21.09.2019139.22 Кб2matved.docx
#
24.04.201933.9 Mб2maximum.docx
#
09.02.2015360.31 Кб63MA_1_пособие.pdf
#
09.02.201534.57 Mб8MA_Kudriav1.pdf
#
09.02.201526.97 Mб11MA_Kudriav2.pdf