胡文美(Wen-mei W. Hwu) NVIDIA公司杰出研究科学家兼高级研究总监。伊利诺伊大学厄巴纳-香槟分校荣休教授,并行计算研究中心首席科学家。他在编译器设计、计算机体系结构、微体系结构和并行计算方面做出了卓越贡献,是IEEE Fellow、ACM Fellow,荣获了包括ACM-IEEE CS Eckert-Mauchly奖、ACM Grace Murray Hopper奖、ACM SIGARCH Maurice Wilkes奖在内的众多奖项。他拥有加州大学伯克利分校计算机科学博士学位。
大卫·B. 柯克(David B. Kirk) 美国国家工程院院士,NVIDIA Fellow,曾任NVIDIA公司首席科学家。2002年,他荣获ACM SIGGRAPH计算机图形成就奖,以表彰其在把高性能计算机图形系统推向大众市场方面做出的杰出贡献。他拥有加州理工学院计算机科学博士学位。
伊扎特·埃尔·哈吉(Izzat El Hajj) 贝鲁特美国大学计算机科学系助理教授。他的研究方向是针对新兴并行处理器和内存技术的应用加速和编程支持,特别是GPU和内存内处理。他拥有伊利诺伊大学厄巴纳-香槟分校电气与计算机工程博士学位。
目錄:
Contents
Foreword
Preface
Acknowledgments
CHAPTER 1 Introduction 1
1.1 Heterogeneous parallel computing 3
1.2 Why more speed or parallelism 7
1.3 Speeding up real applications 9
1.4 Challenges in parallel programming 11
1.5 Related parallel programming interfaces 13
1.6 Overarching goals 14
1.7 Organization of the book 15
References 19
Part I Fundamental Concepts
CHAPTER 2 Heterogeneous data parallel computing 23
With special contribution from David Luebke
2.1 Data parallelism 23
2.2 CUDA C program structure 27
2.3 A vector addition kernel 28
2.4 Device global memory and data transfer 31
2.5 Kernel functions and threading 35
2.6 Calling kernel functions 40
2.7 Compilation 42
2.8 Summary 43
Exercises 44
References 46
CHAPTER 3 Multidimensional grids and data 47
3.1 Multidimensional grid organization 47
3.2 Mapping threads to multidimensional data 51
3.3 Image blur: a more complex kernel 58
3.4 Matrix multiplication 62
3.5 Summary 66
Exercises 67
CHAPTER 4 Compute architecture and scheduling 69
4.1 Architecture of a modern GPU 70
4.2 Block scheduling 70
4.3 Synchronization and transparent scalability 71
4.4 Warps and SIMD hardware 74
4.5 Control divergence 79
4.6 Warp scheduling and latency tolerance 83
4.7 Resource partitioning and occupancy 85
4.8 Querying device properties 87
4.9 Summary 90
Exercises 90
References 92
CHAPTER 5 Memory architecture and data locality 93
5.1 Importance of memory access efficiency 94
5.2 CUDA memory types 96
5.3 Tiling for reduced memory traffic 103
5.4 A tiled matrix multiplication kernel 107
5.5 Boundary checks 112
5.6 Impact of memory usage on occupancy 115
5.7 Summary 118
Exercises 119
CHAPTER 6 Performance considerations 123
6.1 Memory coalescing 124
6.2 Hiding memory latency 133
6.3 Thread coarsening 138
6.4 A checklist of optimizations 141
6.5 Knowing your computation’s bottleneck 145
6.6 Summary 146
Exercises 146
References 147
Part II Parallel Patterns
CHAPTER 7 Convolution
An introduction to constant memory and caching 151
7.1 Background 152
7.2 Parallel convolution: a basic algorithm 156
7.3 Constant memory and caching 159
7.4 Tiled convolution with halo cells 163
7.5 Tiled convolution using caches for halo cells 168
7.6 Summary 170
Exercises 171
CHAPTER 8 Stencil 173
8.1 Background 174
8.2 Parallel stencil: a basic algorithm 178
8.3 Shared memory tiling for stencil sweep 179
8.4 Thread coarsening 183
8.5 Register tiling 186
8.6 Summary 188
Exercises 188
CHAPTER 9 Parallel histogram 191
9.1 Background 192
9.2 Atomic operations and a basic histogram kernel 194
9.3 Latency and throughput of atomic operations 198
9.4 Privatization 200
9.5 Coarsening 203
9.6 Aggregation 206
9.7 Summary 208
Exercises 209
References 210
CHAPTER 10 Reduction
And minimizing divergence 211
10.1 Background 211
10.2 Reduction trees 213
10.3 A simple reduction kernel 217
10.4 Minimizing control divergence 219
10.5 Minimizing memory divergence 223
10.6 Minimizing global memory accesses
內容試閱:
Preface
We are proud to introduce to you the fourth edition of Programming Massively Parallel Processors: A Hands-on Approach.
Mass market computing systems that combine multicore CPUs and many-thread GPUs have brought terascale computing to laptops and exascale computing to clusters. Armed with such computing power, we are at the dawn of the wide-spread use of computational experiments in the science, engineering, medical, and business disciplines. We are also witnessing the wide adoption of GPU computing in key industry vertical markets, such as finance, e-commerce, oil and gas, and manufacturing. Breakthroughs in these disciplines will be achieved by using computational experiments that are of unprecedented levels of scale, accuracy, safety, controllability, and observability. This book provides a critical ingredient for this vision: teaching parallel programming to millions of graduate and under-graduate students so that computational thinking and parallel programming skills will become as pervasive as calculus skills.
The primary target audience of this book consists of graduate and undergradu-ate students in all science and engineering disciplines in which computational thinking and parallel programming skills are needed to achieve breakthroughs. The book has also been used successfully by industry professional developers who need to refresh their parallel computing skills and keep up to date with ever-increasing speed of technology evolution. These professional developers work in fields such as machine learning, network security, autonomous vehicles, computa-tional financing, data analytics, cognitive computing, mechanical engineering, civil engineering, electrical engineering, bioengineering, physics, chemistry, astronomy, and geography, and they use computation to advance their fields. Thus these developers are both experts in their domains and programmers. The book takes the approach of teaching parallel programming by building up an intu-itive understanding of the techniques. We assume that the reader has at least some basic C programming experience. We use CUDA C, a parallel programming environment that is supported on NVIDIA GPUs. There are more than 1 billion of these processors in the hands of consumers and professionals, and more than 400,000 programmers are actively using CUDA. The applications that you will develop as part of your learning experience will be runnable by a very large user community.
Since the third edition came out in 2016, we have received numerous com-ments from our readers and instructors. Many of them told us about the existing features they value. Others gave us ideas about how we should expand the book’s contents to make it even more valuable. Furthermore, the hardware and software for heterogeneous parallel computing have advanced tremendously since 2016. In the hardware arena, three more generations of GPU computing architectures, namely, Volta, Turing, and Ampere, have been introduced since the third edition. In the software domain, CUDA 9 through CUDA 11 have allowed programmers to access new hardware and system features. New algorithms have also been developed. Accordingly, we added four new chapters and rewrote a substantial number of the existing chapters.
The four newly added chapters include one new foundational chapter, namely, Chapter 4 (Compute Architecture and Scheduling), and three new parallel patterns and applications chapters: Chapter 8 (Stencil), Chapter 10 (Reduction and Minimizing Divergence), and Chapter 13 (Sorting). Our motivation for adding these chapters is as follows:
.
Chapter 4 (Compute Architecture and Scheduling): In the previous edition the discussions on architecture and scheduling considerations were scattered across multiple chapters. In this edition, Chapter 4 consolidates these discussions into one focused chapter that serves as a centralized reference for readers who are particularly interested in this topic.
.
Chapter 8 (Stencil): In the previous edition the stencil pat