Confessions of a Speed Junkie (Overview)

Friday, August 14, 2009

Hitting the Wall

No, this isn’t going to be another fabricated memoir about drug addiction. This is my attempt to shed some light on what I believe is a rapidly approaching paradigm shift in parallel computing; General Purpose Graphics Processing Unit computing (GPGPU). I am a Systems Architect at a large financial institution. As you have probably read on the web, we robber barons of Wall Street use computer models concocted by PhD’ed rocket scientist and programmed by people like me. These sophisticated models enable us to build systems that virtually print money (well not really but pretty damn close). These systems require an enormous amount of computational resources to function. We typically satisfy these needs by parallelizing our computations across large farms of servers.

At its core parallel computing is using multiple computational resources in parallel to solve a computational problem. The goal of solving computational problems in parallel is to save time or money or both. Additionally some problems are so large that their only practical solution is through parallel computing. Parallel computing is used in the areas such as graphics processing, computational finance, data mining, seismology, mathematics and physics just to name a few.

Parallel computing is becoming more critical to the computing world due to the impending collapse of Moore's law. Gordon Moore, a co-founder of Intel, noted that the number of transistors that can be placed inexpensively on an integrated circuit increases exponentially doubling every two years. More transistors mean faster more powerful CPUs. This observation coined the phrase "Moore's Law". Gordon Moore also coined a lesser know phrase "Moore's Second Law".

Moore's second law states that the R&D, manufacturing and QA cost associated with semiconductor fabrication also increase exponentially over time. Moore’s second law hypothesized that at some point the increasing fabrication cost coupled with the physical limitations of semiconductor fabrication materials will erect a brick wall halting the exponential advance of CPU clock rates. While Gordon Moore did not predict when this halting would occur, recent products from Intel and AMD would suggest that they (we) have hit it. Both companies seem to have at least temporarily abandoned their relentless quest for faster CPUs in favor of CPUs consisting of many cores.

In the past if a programs performance was less than adequate for the user you could typically just buy more powerful hardware to make it faster. With the impending collapse of Moore's law this will no longer be true. So until they come up with subatomic computing, programmers will need to embrace parallel computing more in order to squeeze performance gains from their applications.

So why am I posting this? Because GPGPU is, in my opinion, going to have a major impact on the way programmers parallelize their applications in the future. By adopting technologies like CUDA / OpenCL programmers will be able to tap into a vast amount of computational resources at a very low cost. Those who quickly adapt to this paradigm shift will thrive and prosper… those who don’t will wither and die.

So you want to know more about this GPGPU thing...?
See my next post “GPGPU 101”.

Change... It's a good thing

So now you know what GPGPU is, what it is not, the magnitude of the performance gains you can expect to get now and in the near future, and a bit of history as to how we got here. But how can you effect the necessary change in your organization to benefit from this new technology?

Well take it from me… It ain’t easy. I’ve been following GPGPU since Nvidia released their first CUDA SDK. Given that I currently manage several computational grids for my firm, I was quick to see the potential benefits. I naively assumed that all I would need to do to get my firm moving in the GPGPU direction was to build some of the Nvidia CUDA option valuation examples and integrate them into our grid. Once I showed people the potential gains that could be achieved they would happily start moving down the path to GPGPU nirvana.

Boy was I “mistaken”. The first complaint was, “It doesn’t support double precision so it’s useless to us. Get back to me when it supports doubles.”. Six months later, after it supported double precision the complaint was, “This requires me to rewrite my code so I’m not interested. Get back to me when I can just recompile my code and it is magically 100X faster.”. Well that silver bullet may come sometime down the road but I’m not going to hold my breath (and you shouldn't either).

The most frequent reaction was, “I already have the computational capacity to get my work done. Why should I spend the necessary development resources just to save a few million dollars?”. Which to people outside of the financial industry this must seem like an odd reaction. Why wouldn’t you spend 6 months of development time to save a few million? Well when you take into consideration that many of these systems make millions of dollars a day, any savings could be wiped out (and then some) by a single mistake in rolling out GPGPU based versions of the models.

With all of these impediments to change how do you get the ball rolling? Well you have got to get involved in the initial stages of something new. Get involved when they don’t already have a system. Get involved before they spend $10,000,000 for hardware. The resistance to change seems to disappear when you can show them their model running on a GPU side by side with a CPU based version. The resistance will disappear when you show them that they can get a 30X – 100X speed up over their CPU based model. Point out that instead of paying $10,000,000 for hardware they can do the same number of calculations for $250,000.

This concludes my series on GPGPU Overview. I hope you found it insightful. The example sections will walk you through some CUDA, OpenCL, and PyCUDA code to get your feet wet. Under products you will find useful information on current offerings from many different vendors (hardware and software). The Resource section will point you to documentation and tutorials to make you productive quickly. News will keep you up to date on various GPGPU topics.

If you would like to follow my site, be alerted to updated information, and hear about interesting events in the GPGPU industry just go to my News section and subscribe to my site feed.

Live Free or Die
UNIX

Wednesday, August 5, 2009

GPGPU (A Historical Perspective)

In the immortal words of George Santayana, “Those who cannot remember the past are doomed to repeat it”. So let’s take a look at the past. GPGPU is nothing new. Developers have been doing it for a long time but due to the lack of any GPGPU standard API, developers were forced to express their non graphics based algorithms in terms of a graphics based API like OpenGL or DirectX. This is NOT easy to do. A few GPGPU research projects took on the task of building a GPGPU API layer on top of the raw graphics APIs to simplify GPGPU programming for mainstream developers. The two most notable projects were BrookGPU and Lib Sh.

BrookGPU is the Stanford University Graphics group's compiler and runtime implementation of the Brook stream programming language for using modern graphics hardware for general purpose computations. Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar and efficient language. Brook has been in beta for a long time. The driving force behind BrookGPU was Ian Buck. BrookGPU was the result of his PhD Thesis. The last major beta release was all the way back in October 2004 but renewed development began, and stopped again, in November 2007.

Lib Sh is the result of research at the University of Waterloo Computer Graphic Lab. The language was conceived and implemented by several members of the CGL. Lib Sh is a metaprogramming language for programmable GPUs. As of August 2006, Lib Sh is no longer maintained. RapidMind Inc. was formed to commercialize the research behind Lib Sh.

Some time in 2005 Nvidia decided to get into the GPGPU game. Nvidia hired Ian Buck, the man behind BrookGPU, to head up the development of CUDA (Toms Hardware has an interesting interview with Ian Buck). CUDA stands for Compute Unified Device Architecture and it is Nvidia’s proprietary solution to the GPGPU programming problem. Nvidia was in a unique position in that it was trying to provide a general purpose programming solution for the GPUs that it designed. This allowed Nvidia to actually make changes in the design of the GPUs and drivers to better facilitate GPGPU programming of their devices. They also provided GPGPU APIs that were not layered on top of the raw graphics APIs resulting in a more streamlined product.

Well if I were a CPU company I would have been getting a bit nervous right about now. With all of these implementations of GPGPU APIs floating around people are going to need fewer CPUs to get their number crunching work done. I might start investigating a way to get in the game. This is just about the time AMD bought ATI (Nvidia’s arch rival). By the time the acquisition was finished AMD was playing catch-up to Nvidia in the GPGPU market segment. Nvidia already had a usable product (hardware and software) available for developers. To try and leapfrog Nvidia AMD / ATI adopted a modified version of BrookGPU to base their GPGPU offering on. This toolkit became know as Brook+ and is also maintained by Stanford.

While all of this was going on a tiny little company called Apple was trying to figure out a way to position developers on their Mac OS to easily adapt their code to the multi-core CPU revolution that is currently under way. They came up with something that they call Grand Central. In a nutshell Grand Central allows developers to be freed from many of the difficult tasks involved in multi-threaded parallel programming. As I mentioned in my “Hitting the Wall” post, parallel programming is the only way developers are going to be able to make their applications run faster now that CPUs are becoming more parallel instead of faster. Someone at Apple happened to notice that the Grand Central work they were doing also mapped nicely to the GPGPU problem space. So out of the goodness of their hearts Apple wrapped up their Grand Central API into an API specification and released it to the Khronos Group (an industry consortium that creates open standards) as OpenCL (Open Compute Language).

The Khronos Group got all the major players (Intel, AMD, ATI, Nvidia, RapidMind, ClearSpeed, Apple, IBM, Texas Instruments, Toshiba, Los Alamos Nation Laboratory, Motorola, QNX, and others) together and they hammered out the 1.0 version of the OpenCL specification in record time. Programs written to run on OpenCL are written in a manner that is by definition parallel. This means that not only can they take advantage of the multiple cores on a CPU but that they can also take advantage of the hundreds of cores on a GPU. But it doesn’t stop there. OpenCL drivers are being written for IBM Cell processors (you know… the little fellow that powers your PS3), ATI GPUs, Nvidia GPUs, AMD CPUs, Intel CPUs and a host of other computing devices.

This concludes our history lesson for today. So what does the future hold? Well if you want to do GPGPU today you don’t really have a decision to make. While both AMD and Nvidia have OpenCL drivers out they are still in beta and not ready for prime time... yet. The only technology that is robust and mature enough today is Nvidia’s CUDA. If you want to be prepared for the future then OpenCL is the way to go. With OpenCL you can build a parallel program that will run on a CPU, GPU (Nvidia’s or ATI’s), and a Cell Processor. The same binary! Well not exactly the same binary there is a runtime compilation component to OpenCL but you only have to pay that penalty the first time you run your program on a new device. With OpenCL you are well positioned for new devices. When Intel releases their Larrabee processors next year it’s a sure bet that they will be providing OpenCL drivers for it.

So you want to get your firm moving down the GPGPU path...?
Stay tuned for "Change... It's a Good Thing"

Tuesday, August 4, 2009

Don’t Hammer in Screws

My shop teacher in High School once told me that it is very important to pick the right tool for the job. To a hammer the entire world looks like a nail. Don’t get caught hammering in screws. While GPGPU can achieve amazing performance gains it is only good at certain things. The key point is to know what it is good at and use it for that and to know what it is not good at and to NOT use it for that. It seems so obvious when you say “Don’t hammer in screws” but it is not so obvious when someone suggests you should build a high speed trading system in Java. Now, don’t get me wrong, Java is a fine language when used to build certain types of systems. But perhaps it is not the best choice for building high speed trading systems. Let’s explore what types of algorithms work best on GPUs.

Gene Amdahl, a computer architect for IBM, coined the phrase “Amdahl’s Law”. Amdahl’s Law states that the speedup of a program using multiple processors is limited by the time needed for the sequential fraction of the program. For example, if a program needs 20 hours using a single processor core, and a particular portion of 1 hour cannot be parallelized, while the remaining portion of 19 hours (95%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimal execution time cannot be less than that critical 1 hour. Hence the speed up is limited up to 20X. The flip side of this analysis is that if a program needs 20 hours using a single processor core, and a particular portion of 19 hours cannot be parallelized, while the remaining portion of 1 hour (5%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimal execution time cannot be less than that critical 19 hours.

Michael Flynn, a professor emeritus at Stanford University, came up with a method of classifying computer architectures. This method of classification became know as “Flynn’s Taxonomy”. The taxonomy is based on the number of concurrent instruction (or control) and data streams available in the architecture. Flynn’s breakdown is as follows:

Single Instruction, Single Data (SISD)
Single Instruction, Multiple Data (SIMD)
Multiple Instruction, Single Data (MISD)
Multiple Instruction, Multiple Data (MIMD)

SISD is a sequential computer which exploits no parallelism in either the instruction or data streams. Examples of SISD architecture are the traditional uniprocessor machines like an old PC or old mainframes. SIMD is a computer which exploits multiple data streams against a single instruction stream to perform operations which may be naturally parallelized. For example, an array processor or GPU.

Let’s dig a little deeper into SIMD. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.
For instance, consider a 2-processor system (CPUs A and B) in a parallel environment, and we wish to do a task on some data d. It is possible to tell CPU A to do that task on one part of d and CPU B on another part simultaneously, thereby reducing the duration of the execution. The data can be assigned using conditional statements as described below. As a specific example, consider adding two matrices. In a SIMD implementation, CPU A could add all elements from the top half of the matrices, while CPU B could add all elements from the bottom half of the matrices. Since the two processors work in parallel, the job of performing matrix addition would take one half the time of performing the same operation in serial using one CPU alone. Now imagine that instead of two processors you have 200 (like most modern day GPUs have). You should be able to achieve a 200X speedup of this algorithm by running it on a GPU. Well that's the theory. In practice you will have to copy data on and off the device and possibly swap data back and forth between shared memory and global memory on the GPU so you are not going to actually get a 200X speedup. But you will get a very significant speedup.

So before you go porting all your code to GPUs you need to make sure that the code that you are porting is the right code… the code that can be sped up significantly. Now how do you define significantly? Well that may be different from situation to situation. If this is a program that you run once a month that takes 4 hours to run 100X faster might not be enough to warrant porting it. If this is an algorithm that is run 1,000,000,000 times a day 2X speedup might warrant porting it.

Stay tuned for “GPGPU (A Historical Perspective)”

Monday, August 3, 2009

Cage Match (CPU vs. GPU)

In computing, most central processing units are labeled in terms of their clock speed expressed in gigahertz. This number refers to the frequency of the CPU's master clock signal ("clock speed"). This signal is simply an electrical voltage which changes from low to high and back again at regular intervals. Hertz has become the primary unit of measurement accepted by the general populace to determine the speed of a CPU. While it may make sense to compare like architecture CPUs with each other in terms of their clock speed it does not make sense to compare the clock speed of disparate CPU architectures. For example if we are comparing a sub scalar CPU with a clock speed of 3.2 GHz with a super scalar CPU with a clock speed of 2.2 GHz we can not really tell which one is faster based on their GHz rating alone.

To compare heterogeneous CPU architectures, in terms of their speed, we need to use a different metric than their clock speed. The industry accepted measure is FLOPS (FLoating point Operations Per Second). For example if a system is rated at a GFLOP of computational capacity it means that the system can perform 10 to the 9 (1,000,000,000) floating point operations per second. The fastest computer in existence today is capable of performing 1.4 Peta FLOPS (or 1,400,000,000,000,000 floating point operations a second).

So now that we have a universal measure of computational capacity how does a CPU stack up against a GPU?

From the figure above we can see that Intel’s Harpertown chips have about an 80 GFLOP rating while Nvidia’s most powerful card shipping today offers just shy of a Terra FLOP worth of computational capacity. According to this chart the GPU is 10 times faster than the CPU. While this is interesting I don’t really care about how fast they can run some esoteric computational capacity test program designed to measure their performance. What I really care about is how many stock options they can perform a theoretical value calculation on using a binomial model in comparable time intervals.

Well as luck would have it I happen to have just such a program available for both the CPU and the GPU so let’s run a few tests and see what we “see”. The first graph is comparing the CPU model running on a 3.2 GHz Intel Xeon W5580 (codenamed Nehalem) to a GPU model running on an Nvidia Tesla card. The Intel CPU is the fastest CPU shipping today from any vendor.

As the chart clearly shows the GPU can perform this particular algorithm close to 10X faster than the CPU. I’m using a double precision algorithm for this comparison and the Nvidia GPU is not very fast at double precision math right now. According to information leaked to the web, Nvidia is currently working on a new chip codenamed Fermi containing more double precision ALUs which should yield the performance results in the chart below. (I’m not getting into the details of my calculation, so until they start shipping their new card you’ll just have to take my word for it.)

By adding more double precision ALUs Nvidia’s next generation GPUs should be 30X faster at running this algorithm than the fastest CPU shipping today. So what does this mean…? To a programmer in the financial industry this means that if I port my option valuation algorithm to run on the GPU that I will need 30X fewer CPUs to perform the same number of calculations which will reduce the cost of my computation farm by 30X. Now if my computation farm only cost $30,000 I’m only going to save $29,000 so it’s probably not worth it. If I am pricing all options in the American and European universe in real-time (theoretical values and implied volatility) my computational farm might costs somewhere around half a billion dollars. So running this algorithm on the GPU instead of the CPU could save me around $480,000,000.

Nvidia CEO Jen-Hsun Huang, while speeking at the Hot Chips symposium at Stanford University, predicted that GPU computing will experience a rapid performance boost over the next six years. According to Huang, GPU compute is likely to increase its current capabilities by 570x, while 'pure' CPU performance will progress by a limited 3x. Since GPUs are already 30X faster at running our binomial option model and they are going to get 570X faster than that, if we account for a 3X speedup of the CPU we end up with something that is 5700X faster than running on a CPU. Now these are just his predictions but perhaps six years from now we will be talking about "Haung's Law" instead of Moore's.

Pretty amazing… right…? Well before you run off and tell your boss you are going to port your entire code base to run on GPUs you’re going to need to consider a few more data points.

See the next post “Don’t Hammer in Screws”

GPGPU 101

GPGPU is the practice of using graphics processors for general purpose computing. Why on earth would you want to jump through all the hoops required to do that? To answer this we first need to look at how a CPUs and GPUs really work.

A CPU (Central Processing Unit) functions by executing a sequence of instructions. These instructions reside in some sort of main memory and typically go through four distinct phases during their CPU lifecycle: fetch, decode, execute, and writeback. During the fetch phase the instruction is retrieved from main memory and loaded onto the CPU. Once the instruction is fetched it is decoded or broken down into an opcode (operation to be performed) and operands containing values or memory locations to be operated on by the opcode. Once it is determined what operation needs to be performed the operation is executed. This may involve copying memory to locations specified in the instructions operands or having the ALU (arithmetic logic unit) perform a mathematical operation. The final phase is the writeback of the result to either main memory or a CPU register. After the writeback the entire process repeats. This simple form of CPU operation is referred to as subscalar in that the CPU executes one instruction operating on one or two pieces of data at a time. Given the sequential nature of this design it will take four clock cycles to process a single instruction. That is why this type of operation is referred to “sub”scalar or less than one instruction per clock cycle.

To make their CPUs faster chip manufactures started to create parallel execution paths in their CPUs by pipelining their instructions. Pipelining allows more than one step in the CPU lifecycle to be performed at any given time by breaking down the pathway into discrete stages. This separation can be compared to an assembly line, in which an instruction is made more complete at each stage until it exits the execution pipeline and is retired. While pipelining instructions will result in faster CPUs the best performance that can be achieved is scalar or one complete instruction per cycle.

To achieve speeds faster than scalar (or superscalar) chip manufactures started to embed multiple execution units in their designs increasing their degree of parallelism even more. In a superscalar pipeline, multiple instructions are read and passed to a dispatcher, which decides whether or not the instructions can be executed in parallel. If so they are dispatched to available execution units, resulting in the ability for several instructions to be executed simultaneously. In general, the more instructions a superscalar CPU is able to dispatch simultaneously to waiting execution units, the more instructions will be completed in a given clock cycle. By using techniques like instruction pipelining and adding multiple execution units modern day CPUs have significantly increased their degree of instruction parallelism, however, they still lag far behind GPUs as discussed below.

A GPU is a special purpose processor, known as a stream processor, specifically designed to performing a very large number of floating point operations in parallel. These processors may be integrated on the motherboard or attached via a PCIExpress card. Modern day GPUs typically contain several multi processors each containing many processing cores. Today’s high end cards typically have gigabytes of dedicated memory and several hundred processors running thousands of threads all dedicated to performing floating point math.

Stream processing is a technique used to achieve a limited form of parallelism known as data level parallelism. The concepts behind stream processing originated back in the heyday of the Supercomputer. Applications running on a stream processor can use multiple computational units, such as the floating point units on a GPU, without explicitly managing allocation, synchronization, or communication among those units. Not all algorithms can be expressed in terms of a data parallel solution but for the ones that can they can realize significant performance gains by running on a GPU and taking advantage of the massive parallelism of the device compared to the much more limited degree of parallelism of modern day CPUs.

While this explains what GPGPU is it still doesn't answer the question "Why should I do it?".
See my next post “Cage Match (CPU vs GPU)”