Tuesday, August 4, 2009

Don’t Hammer in Screws

My shop teacher in High School once told me that it is very important to pick the right tool for the job. To a hammer the entire world looks like a nail. Don’t get caught hammering in screws. While GPGPU can achieve amazing performance gains it is only good at certain things. The key point is to know what it is good at and use it for that and to know what it is not good at and to NOT use it for that. It seems so obvious when you say “Don’t hammer in screws” but it is not so obvious when someone suggests you should build a high speed trading system in Java. Now, don’t get me wrong, Java is a fine language when used to build certain types of systems. But perhaps it is not the best choice for building high speed trading systems. Let’s explore what types of algorithms work best on GPUs.

Gene Amdahl, a computer architect for IBM, coined the phrase “Amdahl’s Law”. Amdahl’s Law states that the speedup of a program using multiple processors is limited by the time needed for the sequential fraction of the program. For example, if a program needs 20 hours using a single processor core, and a particular portion of 1 hour cannot be parallelized, while the remaining portion of 19 hours (95%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimal execution time cannot be less than that critical 1 hour. Hence the speed up is limited up to 20X. The flip side of this analysis is that if a program needs 20 hours using a single processor core, and a particular portion of 19 hours cannot be parallelized, while the remaining portion of 1 hour (5%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimal execution time cannot be less than that critical 19 hours.

Michael Flynn, a professor emeritus at Stanford University, came up with a method of classifying computer architectures. This method of classification became know as “Flynn’s Taxonomy”. The taxonomy is based on the number of concurrent instruction (or control) and data streams available in the architecture. Flynn’s breakdown is as follows:

Single Instruction, Single Data (SISD)
Single Instruction, Multiple Data (SIMD)
Multiple Instruction, Single Data (MISD)
Multiple Instruction, Multiple Data (MIMD)

SISD is a sequential computer which exploits no parallelism in either the instruction or data streams. Examples of SISD architecture are the traditional uniprocessor machines like an old PC or old mainframes. SIMD is a computer which exploits multiple data streams against a single instruction stream to perform operations which may be naturally parallelized. For example, an array processor or GPU.

Let’s dig a little deeper into SIMD. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.
For instance, consider a 2-processor system (CPUs A and B) in a parallel environment, and we wish to do a task on some data d. It is possible to tell CPU A to do that task on one part of d and CPU B on another part simultaneously, thereby reducing the duration of the execution. The data can be assigned using conditional statements as described below. As a specific example, consider adding two matrices. In a SIMD implementation, CPU A could add all elements from the top half of the matrices, while CPU B could add all elements from the bottom half of the matrices. Since the two processors work in parallel, the job of performing matrix addition would take one half the time of performing the same operation in serial using one CPU alone. Now imagine that instead of two processors you have 200 (like most modern day GPUs have). You should be able to achieve a 200X speedup of this algorithm by running it on a GPU. Well that's the theory. In practice you will have to copy data on and off the device and possibly swap data back and forth between shared memory and global memory on the GPU so you are not going to actually get a 200X speedup. But you will get a very significant speedup.

So before you go porting all your code to GPUs you need to make sure that the code that you are porting is the right code… the code that can be sped up significantly. Now how do you define significantly? Well that may be different from situation to situation. If this is a program that you run once a month that takes 4 hours to run 100X faster might not be enough to warrant porting it. If this is an algorithm that is run 1,000,000,000 times a day 2X speedup might warrant porting it.

Stay tuned for “GPGPU (A Historical Perspective)”