Monday, August 3, 2009

GPGPU 101

GPGPU is the practice of using graphics processors for general purpose computing. Why on earth would you want to jump through all the hoops required to do that? To answer this we first need to look at how a CPUs and GPUs really work.

A CPU (Central Processing Unit) functions by executing a sequence of instructions. These instructions reside in some sort of main memory and typically go through four distinct phases during their CPU lifecycle: fetch, decode, execute, and writeback. During the fetch phase the instruction is retrieved from main memory and loaded onto the CPU. Once the instruction is fetched it is decoded or broken down into an opcode (operation to be performed) and operands containing values or memory locations to be operated on by the opcode. Once it is determined what operation needs to be performed the operation is executed. This may involve copying memory to locations specified in the instructions operands or having the ALU (arithmetic logic unit) perform a mathematical operation. The final phase is the writeback of the result to either main memory or a CPU register. After the writeback the entire process repeats. This simple form of CPU operation is referred to as subscalar in that the CPU executes one instruction operating on one or two pieces of data at a time. Given the sequential nature of this design it will take four clock cycles to process a single instruction. That is why this type of operation is referred to “sub”scalar or less than one instruction per clock cycle.

To make their CPUs faster chip manufactures started to create parallel execution paths in their CPUs by pipelining their instructions. Pipelining allows more than one step in the CPU lifecycle to be performed at any given time by breaking down the pathway into discrete stages. This separation can be compared to an assembly line, in which an instruction is made more complete at each stage until it exits the execution pipeline and is retired. While pipelining instructions will result in faster CPUs the best performance that can be achieved is scalar or one complete instruction per cycle.

To achieve speeds faster than scalar (or superscalar) chip manufactures started to embed multiple execution units in their designs increasing their degree of parallelism even more. In a superscalar pipeline, multiple instructions are read and passed to a dispatcher, which decides whether or not the instructions can be executed in parallel. If so they are dispatched to available execution units, resulting in the ability for several instructions to be executed simultaneously. In general, the more instructions a superscalar CPU is able to dispatch simultaneously to waiting execution units, the more instructions will be completed in a given clock cycle. By using techniques like instruction pipelining and adding multiple execution units modern day CPUs have significantly increased their degree of instruction parallelism, however, they still lag far behind GPUs as discussed below.

A GPU is a special purpose processor, known as a stream processor, specifically designed to performing a very large number of floating point operations in parallel. These processors may be integrated on the motherboard or attached via a PCIExpress card. Modern day GPUs typically contain several multi processors each containing many processing cores. Today’s high end cards typically have gigabytes of dedicated memory and several hundred processors running thousands of threads all dedicated to performing floating point math.

Stream processing is a technique used to achieve a limited form of parallelism known as data level parallelism. The concepts behind stream processing originated back in the heyday of the Supercomputer. Applications running on a stream processor can use multiple computational units, such as the floating point units on a GPU, without explicitly managing allocation, synchronization, or communication among those units. Not all algorithms can be expressed in terms of a data parallel solution but for the ones that can they can realize significant performance gains by running on a GPU and taking advantage of the massive parallelism of the device compared to the much more limited degree of parallelism of modern day CPUs.

While this explains what GPGPU is it still doesn't answer the question "Why should I do it?".
See my next post “Cage Match (CPU vs GPU)”