Monday, August 3, 2009

Cage Match (CPU vs. GPU)

In computing, most central processing units are labeled in terms of their clock speed expressed in gigahertz. This number refers to the frequency of the CPU's master clock signal ("clock speed"). This signal is simply an electrical voltage which changes from low to high and back again at regular intervals. Hertz has become the primary unit of measurement accepted by the general populace to determine the speed of a CPU. While it may make sense to compare like architecture CPUs with each other in terms of their clock speed it does not make sense to compare the clock speed of disparate CPU architectures. For example if we are comparing a sub scalar CPU with a clock speed of 3.2 GHz with a super scalar CPU with a clock speed of 2.2 GHz we can not really tell which one is faster based on their GHz rating alone.

To compare heterogeneous CPU architectures, in terms of their speed, we need to use a different metric than their clock speed. The industry accepted measure is FLOPS (FLoating point Operations Per Second). For example if a system is rated at a GFLOP of computational capacity it means that the system can perform 10 to the 9 (1,000,000,000) floating point operations per second. The fastest computer in existence today is capable of performing 1.4 Peta FLOPS (or 1,400,000,000,000,000 floating point operations a second).

So now that we have a universal measure of computational capacity how does a CPU stack up against a GPU?

From the figure above we can see that Intel’s Harpertown chips have about an 80 GFLOP rating while Nvidia’s most powerful card shipping today offers just shy of a Terra FLOP worth of computational capacity. According to this chart the GPU is 10 times faster than the CPU. While this is interesting I don’t really care about how fast they can run some esoteric computational capacity test program designed to measure their performance. What I really care about is how many stock options they can perform a theoretical value calculation on using a binomial model in comparable time intervals.

Well as luck would have it I happen to have just such a program available for both the CPU and the GPU so let’s run a few tests and see what we “see”. The first graph is comparing the CPU model running on a 3.2 GHz Intel Xeon W5580 (codenamed Nehalem) to a GPU model running on an Nvidia Tesla card. The Intel CPU is the fastest CPU shipping today from any vendor.

As the chart clearly shows the GPU can perform this particular algorithm close to 10X faster than the CPU. I’m using a double precision algorithm for this comparison and the Nvidia GPU is not very fast at double precision math right now. According to information leaked to the web, Nvidia is currently working on a new chip codenamed Fermi containing more double precision ALUs which should yield the performance results in the chart below. (I’m not getting into the details of my calculation, so until they start shipping their new card you’ll just have to take my word for it.)



By adding more double precision ALUs Nvidia’s next generation GPUs should be 30X faster at running this algorithm than the fastest CPU shipping today. So what does this mean…? To a programmer in the financial industry this means that if I port my option valuation algorithm to run on the GPU that I will need 30X fewer CPUs to perform the same number of calculations which will reduce the cost of my computation farm by 30X. Now if my computation farm only cost $30,000 I’m only going to save $29,000 so it’s probably not worth it. If I am pricing all options in the American and European universe in real-time (theoretical values and implied volatility) my computational farm might costs somewhere around half a billion dollars. So running this algorithm on the GPU instead of the CPU could save me around $480,000,000.

Nvidia CEO Jen-Hsun Huang, while speeking at the Hot Chips symposium at Stanford University, predicted that GPU computing will experience a rapid performance boost over the next six years. According to Huang, GPU compute is likely to increase its current capabilities by 570x, while 'pure' CPU performance will progress by a limited 3x. Since GPUs are already 30X faster at running our binomial option model and they are going to get 570X faster than that, if we account for a 3X speedup of the CPU we end up with something that is 5700X faster than running on a CPU. Now these are just his predictions but perhaps six years from now we will be talking about "Haung's Law" instead of Moore's.

Pretty amazing… right…? Well before you run off and tell your boss you are going to port your entire code base to run on GPUs you’re going to need to consider a few more data points.

See the next post “Don’t Hammer in Screws”