Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
MatrixCUDAFranDissertation.pdf
Скачиваний:
14
Добавлен:
22.03.2016
Размер:
2.18 Mб
Скачать

CHAPTER 2. THE ARCHITECTURE OF MODERN GRAPHICS PROCESSORS

in the table. Each generation follows a di erent scalability approach. The GT200 increases the number of SMs in the chip, maintaining the number of SPs per multiprocessor constant. On the other hand, the Fermi architecture improves the amount of SPs per multiprocessor. These are the di erential features between both evolutions.

The GT200 was the first evolution of the G80 architecture in response to the increase in the number of transistors since the first appearance of the unified architecture. The main improvements were the introduction of double-precision support in hardware and the increase in the number of multiprocessors in the chip (from 16 to 30). No other significant changes were introduced with the new processor. Still, the peak performance of the GPU was doubled in single precision. Doubleprecision arithmetic was 8 times slower than single precision.

Fermi [3] represents the major modification in the unified architecture since its introduction in 2006. Many of the novelties introduced by this microarchitecture have a direct impact on or exclusively target general-purpose computations and, more specifically, in scientific computing.

The main improvements in the Fermi architecture appear in the new design and capabilities of the SPA. Each Streaming Multiprocessor features 32 SPs (both the G80 and GT200 featured 8 SPs per multiprocessor). The amount of shared memory per SM increases accordingly to 64 Kbytes. A major change in this generation is the introduction of an L1 cache, mapped to the same SRAM memory as the shared space. The on-chip memory can be configured to act as 48 Kbytes of user-managed shared memory and 16 Kbytes of L1 cache, or as 16 Kbytes of shared memory and 48 Kbytes of L1 cache.

The number of load/store units per SM grows to 16 units, allowing source and destination addresses to be calculated by 16 threads per clock cycle. The register file is further enlarged, up to 32,768 32-bit registers per SM. The number of SFU units increases accordingly to 4 per SM.

Each SP in the Fermi architecture provides a fully pipelined integer arithmetic logic unit (ALU) and floating-point unit (FPU). The architecture implements the fused multiply-add (FMA) instruction for both single and double precision, unlike previous implementations that only implemented MAD for single precision, loosing accuracy in the results. In addition, the integer ALU is now optimized for 32 and 64-bit operations; previous implementations were based on 24-bit accuracy, needing software emulation to perform integer arithmetic.

The improvement in double-precision performance between the GT200 and Fermi is dramatic. In the former, the ratio double-single precision was 1/8. This ratio has been reduced to 1/2 in Fermi, much in the line of modern CPUs. Other features related with HPC, but strictly necessary in graphics computations, include the support for ECC memory and the execution of concurrent kernels in the same SPA.

2.7.Conclusions and implications on GPU computing

Many of the architectural details exposed in this chapter have a direct implication on the design decisions and techniques presented in this dissertation. The following are representative examples of these implications:

Modern GPUs have evolved into complex architectures in order to satisfy the strict requirements of current graphical applications. Additionally, they follow a radically di erent approach in their design compared with general-purpose processors. Although novel programming paradigms have facilitated the development task, the programmer still needs to be aware of many of the architectural details. Explicit management of on-chip memories (shared memory), memory access patters (global memory coalescing and elimination of shared memory

32

2.7. CONCLUSIONS AND IMPLICATIONS ON GPU COMPUTING

 

 

G80

GT200

Fermi

Year

 

2006

2008

2010

Transistors

 

681 million

1.4 billion

3.0 billion

Total SPs

 

128

240

512

 

 

 

 

 

DP Capabilities

 

-

30 FMA ops/clock

256 FMA ops/clock

SP Capabilities

 

128 MADD ops/clock

240 MADD ops/clock

512 FMA ops/clock

 

 

 

 

 

Total SFUs per SM

 

2

2

4

Warp schedulers per SM

 

1

1

2

Shared Memory per SM

 

16 Kbytes

16 Kbytes

48 or 16 Kbytes

L1 Cache per SM

 

-

-

16 or 48 Kbytes

L2 Cache

 

-

-

768 Kbytes

 

 

 

 

 

ECC Memory Support

 

No

No

Yes

Concurrent Kernels

 

No

No

Up to 16

Load/Store Addr. Width

 

32-bit

32-bit

64-bit

Table 2.2: Summary of the main features of the three generations of unified GPUs by NVIDIA

bank conflicts) or divergence control are some examples of programming decisions that the developer must face. Multi-GPU systems present additional problems such as the management of multiple memory address spaces.

Thus, the design, implementation and validation of high-level approaches that hide these details and abstract the programmer from them is an important step towards the fast development of high-performance codes. We introduce such approaches in the framework of single-GPU systems (Chapter 3), multi-GPU systems (Chapter 5), and clusters of GPUs (Chapter 6).

The bottleneck introduced by the PCIExpress bus is more important as the number of GPUs in the system increases. In this type of architectures, a strategy to reduce the number of data transfers is mandatory if high performance is required.

Therefore, the development of run-time systems that carefully orchestrate data transfers between di erent memory spaces is a necessary approach to reduce data transfers, and to hide them from the programmer. The run-time system exposed in Chapter 5 provides a solution to these requirements.

GPUs are processors targeted to the gaming market. Thus, some of their capabilities are not perfectly suited to the HPC arena. The ratio performance/precision is one of them. Although double-precision floating-point arithmetic is possible in modern GPUs, the di erence in performance compared with single-precision floating-point arithmetic is remarkable. Strategies that combine the performance of GPUs in single-precision and the accuracy of double-precision are welcome.

In this dissertation, we apply an iterative-refinement approach (Chapter 4, Section 4.6) as a successful solution for this problem. This solution combines precision and accuracy for the solution of linear systems, although similar guidelines can be applied to other linear algebra operations.

33

CHAPTER 2. THE ARCHITECTURE OF MODERN GRAPHICS PROCESSORS

The particularities of modern GPUs make them suitable only for certain types of applications. Only those applications that exhibit high arithmetic intensity, a large degree of data parallelism, and few divergent paths fit well on these architectures.

In the framework of this dissertation, we have selected those dense linear algebra algorithms that best adapt to the GPU (for example, Level-3 BLAS in Chapter 3).

However, in situations in which certain parts of an algorithm are not well suited to the GPU architecture, we propose hybrid algorithms, in which the strengths of the di erent execution units in the system are exploited in di erent parts of the algorithm. Some examples of these strategies are shown in Chapter 4 for common linear algebra operations.

The experiments reported in this dissertation will employ NVIDIA hardware. However, it is important to note that all techniques exposed and methodology introduced in our work can be potentially adapted to other similar graphics architectures. In general, only a reduced set of optimized BLAS routines are necessary to port the insights gained from our study to other kinds of graphics devices.

Moreover, many of the topics covered in this dissertation are not exclusively applicable to graphics architectures, but also to any general accelerator-based architecture similar to that described for GPUs in this chapter. That is one of the main contributions of the usage of a high-level approach for the development of accelerated linear algebra implementations.

34

Part II

Matrix computations on single-GPU systems

35

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]