Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
MatrixCUDAFranDissertation.pdf
Скачиваний:
14
Добавлен:
22.03.2016
Размер:
2.18 Mб
Скачать

CHAPTER 1. MATRIX COMPUTATIONS ON SYSTEMS EQUIPPED WITH GPUS

1.2.About this document. Motivation and structure

1.2.1.State-of-the-art of linear algebra computation on GPUs

Since the emergence of the term “GPGPU” in 2001, linear algebra and GPU computing have been closely bound. The first e orts towards e cient linear algebra implementations on graphics processors began with the emergence of programmable graphics processors. Larsen et al. [94] introduced the first strategies for the e cient implementation of the matrix-matrix multiplication using graphics-oriented APIs. Fatahalian et al. [63] went one step further in 2004, analyzing and proposing novel optimized strategies towards e cient implementations of the same operation. Actually, this contribution was the origin of the growing interest of linear algebra implementations on the GPU. More complex routines were also studied before the formulation of the CUDA architecture. For example, the authors of [66] proposed the first factorizations implemented on the GPU using graphics-oriented APIs.

With the introduction of the CUDA architecture in 2006, the number of works related with linear algebra on GPUs dramatically increased. Among them, Volkov and Demmel’s [142] is one of the most referenced papers; there the authors thoroughly analyze the performance of modern GPUs, and design ad-hoc CUDA implementations to exploit all the potential of the graphics processor for some basic linear algebra implementations and the most common factorizations (Cholesky, LU with partial pivoting and QR).

Simultaneously with the evolution of the architectures towards systems with multiple GPUs, the scientific community also pursued the adaptation of the libraries to novel multi-GPU systems. In [139], the authors proposed one of the first implementations of the LU factorization with partial pivoting on a machine with two GPUs. The approach adopted in the paper was based on a columncyclic layout of the matrix among the GPUs. This static approach has also been taken by other authors, and is one of the most recurrent solution mainly because its simplicity.

However, for this type of architectures, an alternative solution is to develop intelligent run-time systems that automatically extract the inherent parallelism existing in many dense linear algebra algorithms, while transparently dealing with data transfers between the di erent memory address spaces. In this line, the author of this thesis was one of the precursors of this approach in 2008 [39, 114]. Other studies have followed this preliminary proposal for linear algebra implementations. In [96], the authors propose scalable hybrid implementations of the Cholesky factorization for multi-core systems with several GPUs. This work was extended in [133] to cover a wider range of solvers in the framework of the MAGMA project [6]. The aim of the project is to develop a LAPACKlike implementations for novel architectures based on GPU and multi-GPU systems. However, in their latest release (version 1.0, December 2010), the MAGMA library does not provide multi-GPU support. Also, many of the approaches followed by the library for single-GPU and multi-GPU implementations were previously investigated and published independently in the framework of this thesis.

The solutions contributed by the MAGMA project, or other studies related to GPU computing for dense linear algebra, usually do not address one particular characteristic: programmability. Most of those pursue raw performance, without taking into account how easy is for the programmer to attain that performance. In fact, many of the implementations and improvements presented are ad-hoc designs for a specific architecture, and there is little guarantee that the implicit insights will remain valid for future architectures. Thus, each architectural change will in general require a deep redesign of the algorithms and implementations in these solutions, and a full reevaluation of the new codes. In addition, no systematic methodologies for the evaluation of key parameters in the new implementations were proposed in these works.

8

1.2. ABOUT THIS DOCUMENT. MOTIVATION AND STRUCTURE

Few e orts have been made towards the transformation of the GPUs into a friendly environment for the library developer. To cover this field, FLAME (Formal Linear Algebra Methods Environment) [136, 75, 76, 28] is presented as an e ective solution in the quest of high-performance, easy-to-develop dense linear algebra libraries. Remarkable results have been obtained from the application of the FLAME methodology to multi-core architectures [120, 114]. Many of the works developed within the framework of the FLAME project and GPU computing are in the context of this thesis. As an additional contribution of this thesis, APIs for fast and reliable code development have been implemented [21, 148] with the GPU as the underlying architecture in the framework of the FLAME project. Even out-of-core codes have also exploited the conjunction of both approaches [101, 100] in the linear algebra arena for large problems, which demonstrates the flexibility of the FLAME approach to adapt to novel architectures without traumatic changes from the programmer’s perspective.

Precisely large-scale linear algebra problems have been historically solved on large computing clusters. With the appearance of GPUs, there is a good reason for equipping each one of the nodes of the cluster with a graphics processor. However, the porting of existing distributed-memory linear algebra libraries to clusters of GPUs is a novel and barely explored area, though GPUbased clusters are nowadays emerging as the top performance machines in HPC rankings [134]. Given the increasing interest in this type of architectures, new solutions are demanded in order to port well-known libraries such as ScaLAPACK [11]. Some works have demonstrated that adhoc implementations for linear algebra operations on distributed-memory machines with hardware accelerators can be very e cient [69, 68]. Nevertheless, once more no research is focused on a transparent port of existing libraries to these novel architectures.

In response to these deficiencies, this thesis covers the simultaneous exploitation of programmability and performance on single-GPU architectures, multi-GPU systems and clusters with GPUs.

1.2.2.Motivation and goals

Linear algebra operations lie at the bottom of the “food chain” for many scientific and engineering applications. The performance that can be attained for these operations is often the key factor that determines the global performance of applications. Also, the performance attained by some basic linear algebra operations is often employed as an illustrative example of the potential of a given architecture. Widely spread benchmarks, like LINPACK [56], base their functionality in the analysis of the performance attained by some well-known linear algebra methods.

The usage of linear algebra applications for exploring the capabilities of novel architectures is frequent. Thus, vendors usually devote considerable e orts to develop highly-tuned implementations of a few basic linear algebra operations to illustrate the capabilities of their new products.

In this dissertation we use linear algebra operations as the conducting thread for the investigation of the possibilities o ered by novel architectures based on graphics processors in HPC. Three di erent architectures are targeted: systems with one GPU, systems with multiple GPUs, and clusters of GPUs. For each architecture, we propose a number of techniques to obtain high-performance linear algebra operations.

A second relevant factor considered in the development of the proposed solutions is programmability. This has become a key feature of today’s implementations on multi-core and many-core architectures. Some e orts have been made in the past few years towards the solution of the programmability problem for GPUs, multi-core processors or specialized processors like the Cell B.E. [39, 109, 110].

9

CHAPTER 1. MATRIX COMPUTATIONS ON SYSTEMS EQUIPPED WITH GPUS

To further advance in this area, as an additional contribution of our work, each technique and strategy proposed throughout our work will take into account programmability keeping an eye on reducing the development e ort for the programmer while maintaining performance.

The main goal of this thesis is thus to design, develop and evaluate programming strategies to improve the performance of existing dense linear algebra libraries on systems based on graphics processors. This general goal is divided into the following specific goals:

To design and develop blocked implementations for the main Level-3 BLAS based on a few highly tuned routine implementations, and to derive a set of algorithmic variants for each one of them.

To develop a full set of algorithmic variants for common dense factorizations and to apply hybrid approaches to improve performance and accuracy.

To evaluate new implementations for the Level-3 BLAS and LAPACK-level routines on systems equipped with one GPU, and to compare them with tuned implementations on general purpose processors and vendor-specific implementations on graphics processors (NVIDIA

CUBLAS).

To propose an e cient run-time system focused on systems with multiple GPUs which deals with data transfers and task scheduling transparently from the programmer’s point of view. The run-time system will implement di erent data transfers policies with the goal of reducing the amount of data transfers.

To evaluate the run-time system on a multi-GPU architecture for a representative set of BLAS and LAPACK-level routines, and to analyze the impact of the di erent alternatives on the performance of the solution.

To select and modify a well-known dense linear algebra library for distributed-memory machines (PLAPACK) for clusters in which each node is equipped with one or more GPUs.

To evaluate the library on a large GPU cluster, and to compare the attained performance with that of a purely CPU-based implementation.

These specific goals are developed under two common premises:

Programmability The codes must be easy to develop for the library programmer. Algorithmic variants must be easy to derive and systematically evaluate when necessary. If possible (e.g., in multi-GPU systems) work scheduling and data transfers must be transparent to the programmer. In general, the development of optimized routines cannot be traumatic for the programmer. In this sense, the goal is to show how high-level approaches help in exploring the capabilities of novel architectures, and optimally exploiting them.

Performance The target of our implementations is e ciency. That is the major reason for the usage of GPUs first, and systems with multiple GPUs and clusters of GPUs as their natural evolution. Our aim is to obtain e cient codes for dense linear algebra operations, as well as methodologies and techniques to develop future codes without major programming e orts.

Although these two concepts seem incompatible, this is not necessarily the case: a good development methodology, with systematic derivation and evaluation of di erent alternatives (in other

10

1.2. ABOUT THIS DOCUMENT. MOTIVATION AND STRUCTURE

words, programmability) is the key to rapidly test all the possible parameters that influence the execution times of the codes (that is, performance).

On the hardware side, the aim is to address on three di erent architectures, representative of the state-of-the-art in GPU architectures. These systems comprise platforms with one GPU, systems with multiple GPUs, and clusters with one or more GPUs per node. We will often select a reduced set of operations to illustrate the techniques presented for each architecture. However, we believe that those routines are illustrative enough of the ideas and techniques introduced in the thesis. In fact, in general these ideas can be easily adapted to many other linear algebra operations without a significant e ort or conceptual changes.

In summary, in this dissertation we will not pursue an exhaustive analysis of a full set of linear algebra routines. Instead, we will propose methodologies, techniques and also implementations that are general enough to be adapted to any linear algebra routine on each type of GPU-based architecture, with minor impact in the programmer’s productivity.

1.2.3.Structure of the document

The manuscript is structured in four parts. Part I o ers the basic concepts necessary to understand the rest of the document. In this part, Chapter 1 introduces the reasons underlying the usage of the GPU as a general-purpose computing architecture. In addition, it describes the motivation, main goals, and structure of the document. Chapter 2 describes the architecture of modern graphics processors and briefly introduces the evolution of this type of accelerators.

The three remaining parts naturally correspond to the work with each one of the three di erent classes of platforms targeted in our work: systems with one GPU, systems with more than one GPU, and clusters of GPUs.

Part II addresses the study of basic linear algebra subroutines (Chapter 3) and linear factorizations (Chapter 4) on systems equipped with a single GPU. Chapter 3 addresses the evaluation and tuning of Level-3 BLAS, comparing their performance with those for existing and optimized BLAS implementations on graphics processors (NVIDIA CUBLAS). The approach taken in this chapter casts BLAS operations in terms of a highly-tuned matrix-matrix multiplication implementation in NVIDIA CUBLAS, while exploring the performance of several algorithmic variants for each routine. Chapter 4 presents GPU implementations for the Cholesky and LU factorizations, and the reduction to tridiagonal form. The chapter also introduces hybrid algorithms in which CPU and GPU cooperate to improve performance, and revisits the iterative refinement technique to regain full accuracy while exploiting the capabilities of modern GPUs when operating in single precision.

Part III pursues the design and development of similar implementations on multi-GPU systems. The approach described in Chapter 5 moves the burden of task scheduling and data transfer management from the programmer. A run-time system keeps track of the availability of data on GPU memories while controlling which tasks have all dependencies solved and, therefore, are ready to be dispatched to the di erent processing units. The runtime operation and related mechanisms are illustrated for the Cholesky factorization, and experimental results for common BLAS-3 operations are also given in this chapter.

Part IV presents our programming solution for clusters with one or more GPUs attached to each node. The strategy ports a well-known dense linear algebra infrastructure (PLAPACK) to clusters of GPUs. As in the other two platform classes, the reasons underlying this decision are programmability and performance. Chapter 6 overviews the fundamentals of PLAPACK, and details the necessary changes to adapt its contents to a GPU-accelerated cluster. Results are reported on a large GPU cluster for two well-known linear algebra operations: the matrix-matrix multiplication and the Cholesky factorization.

11

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]