LAPACK-level routines on single-GPU architectures

The optimization of BLAS-3 routines on graphics processors naturally drives to a direct optimization of higher-level libraries based on them, such as LAPACK (Linear Algebra PACKage). However, given the complexity of the routines in this type of libraries, other strategies can be applied to further improve their performance.

In the case of GPU-based implementations, the optimizations applied to the BLAS routines in Chapter 3 can have a direct impact in LAPACK-level implementations, but we advocate for alternative strategies to gain insights and further improve the performance of those implementations using the GPU as an accelerating coprocessor.

In this chapter, we propose a set of improved GPU-based implementations for some representative and widely used LAPACK-level routines devoted to matrix decompositions. New implementations for the Cholesky and LU (with partial pivoting) decompositions, and the reduction to tridiagonal form are proposed and deeply evaluated.

In addition, a systematic evaluation of algorithmic variants similar to that presented in the previous chapter for the BLAS is performed for LAPACK-level routines, together with a set of techniques to boost performance. One of the most innovative technique introduced in this chapter is the view of the GPU as an accelerating co-processor, not only as an isolated functional unit as in the previous chapter. Thus, hybrid, collaborative approaches are proposed in which operations are performed in the most suitable architecture, depending on the particular characteristics of the task to be performed.

Single and double-precision results are presented for the new implementations and, as a novelty, a mixed-precision iterative-reﬁnement approach for the solution of systems of linear equations is presented and validated. The goal of this technique is to exploit the higher performance delivered by modern graphics processors when operating in single-precision arithmetic, keeping at the same time full accuracy in the solution of the system.

As a result, a full family of implementations for widely used LAPACK-level routines is presented, which attains signiﬁcant speedups compared with optimized, multi-threaded implementations on modern general-purpose multi-core processors.

The chapter is organized as follows. Section 4.1 surveys the nomenclature and the most important routines in the LAPACK library. Sections 4.2 and 4.3 introduce the theory underlying the Cholesky factorization, and the approaches and optimizations taken to implement it on the graphics

<<< < Предыдущая 6 7 8 9 10 11 12 13 14 15 16 1718 / 4718 19 20 21 22 23 24 25 26 27 28 29 30 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
22.03.20161.06 Mб224MATER_3.doc
#
18.11.2019295.42 Кб0MATLAB-1.doc
#
19.11.2019203.78 Кб0MATLAB-2.doc
#
09.02.20153.49 Mб22MATLAB-3.doc
#
09.02.2015344.3 Кб10Matrices.pdf
#
22.03.20162.18 Mб14MatrixCUDAFranDissertation.pdf
#
21.09.2019139.22 Кб2matved.docx
#
24.04.201933.9 Mб2maximum.docx
#
09.02.2015360.31 Кб63MA_1_пособие.pdf
#
09.02.201534.57 Mб8MA_Kudriav1.pdf
#
09.02.201526.97 Mб11MA_Kudriav2.pdf