Wednesday, August 5, 2009

GPGPU (A Historical Perspective)

In the immortal words of George Santayana, “Those who cannot remember the past are doomed to repeat it”. So let’s take a look at the past. GPGPU is nothing new. Developers have been doing it for a long time but due to the lack of any GPGPU standard API, developers were forced to express their non graphics based algorithms in terms of a graphics based API like OpenGL or DirectX. This is NOT easy to do. A few GPGPU research projects took on the task of building a GPGPU API layer on top of the raw graphics APIs to simplify GPGPU programming for mainstream developers. The two most notable projects were BrookGPU and Lib Sh.

BrookGPU is the Stanford University Graphics group's compiler and runtime implementation of the Brook stream programming language for using modern graphics hardware for general purpose computations. Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar and efficient language. Brook has been in beta for a long time. The driving force behind BrookGPU was Ian Buck. BrookGPU was the result of his PhD Thesis. The last major beta release was all the way back in October 2004 but renewed development began, and stopped again, in November 2007.

Lib Sh is the result of research at the University of Waterloo Computer Graphic Lab. The language was conceived and implemented by several members of the CGL. Lib Sh is a metaprogramming language for programmable GPUs. As of August 2006, Lib Sh is no longer maintained. RapidMind Inc. was formed to commercialize the research behind Lib Sh.

Some time in 2005 Nvidia decided to get into the GPGPU game. Nvidia hired Ian Buck, the man behind BrookGPU, to head up the development of CUDA (Toms Hardware has an interesting interview with Ian Buck). CUDA stands for Compute Unified Device Architecture and it is Nvidia’s proprietary solution to the GPGPU programming problem. Nvidia was in a unique position in that it was trying to provide a general purpose programming solution for the GPUs that it designed. This allowed Nvidia to actually make changes in the design of the GPUs and drivers to better facilitate GPGPU programming of their devices. They also provided GPGPU APIs that were not layered on top of the raw graphics APIs resulting in a more streamlined product.

Well if I were a CPU company I would have been getting a bit nervous right about now. With all of these implementations of GPGPU APIs floating around people are going to need fewer CPUs to get their number crunching work done. I might start investigating a way to get in the game. This is just about the time AMD bought ATI (Nvidia’s arch rival). By the time the acquisition was finished AMD was playing catch-up to Nvidia in the GPGPU market segment. Nvidia already had a usable product (hardware and software) available for developers. To try and leapfrog Nvidia AMD / ATI adopted a modified version of BrookGPU to base their GPGPU offering on. This toolkit became know as Brook+ and is also maintained by Stanford.

While all of this was going on a tiny little company called Apple was trying to figure out a way to position developers on their Mac OS to easily adapt their code to the multi-core CPU revolution that is currently under way. They came up with something that they call Grand Central. In a nutshell Grand Central allows developers to be freed from many of the difficult tasks involved in multi-threaded parallel programming. As I mentioned in my “Hitting the Wall” post, parallel programming is the only way developers are going to be able to make their applications run faster now that CPUs are becoming more parallel instead of faster. Someone at Apple happened to notice that the Grand Central work they were doing also mapped nicely to the GPGPU problem space. So out of the goodness of their hearts Apple wrapped up their Grand Central API into an API specification and released it to the Khronos Group (an industry consortium that creates open standards) as OpenCL (Open Compute Language).

The Khronos Group got all the major players (Intel, AMD, ATI, Nvidia, RapidMind, ClearSpeed, Apple, IBM, Texas Instruments, Toshiba, Los Alamos Nation Laboratory, Motorola, QNX, and others) together and they hammered out the 1.0 version of the OpenCL specification in record time. Programs written to run on OpenCL are written in a manner that is by definition parallel. This means that not only can they take advantage of the multiple cores on a CPU but that they can also take advantage of the hundreds of cores on a GPU. But it doesn’t stop there. OpenCL drivers are being written for IBM Cell processors (you know… the little fellow that powers your PS3), ATI GPUs, Nvidia GPUs, AMD CPUs, Intel CPUs and a host of other computing devices.

This concludes our history lesson for today. So what does the future hold? Well if you want to do GPGPU today you don’t really have a decision to make. While both AMD and Nvidia have OpenCL drivers out they are still in beta and not ready for prime time... yet. The only technology that is robust and mature enough today is Nvidia’s CUDA. If you want to be prepared for the future then OpenCL is the way to go. With OpenCL you can build a parallel program that will run on a CPU, GPU (Nvidia’s or ATI’s), and a Cell Processor. The same binary! Well not exactly the same binary there is a runtime compilation component to OpenCL but you only have to pay that penalty the first time you run your program on a new device. With OpenCL you are well positioned for new devices. When Intel releases their Larrabee processors next year it’s a sure bet that they will be providing OpenCL drivers for it.

So you want to get your firm moving down the GPGPU path...?
Stay tuned for "Change... It's a Good Thing"