Sunday, June 01, 2008

CUDA and GPU's

I've read a lot on the architecture of GPU's and how it's possible to take advantage of them. The design of a neural network running on a GPU is substantially different from those that run on single-core CPU's. I've worked on a small framework network to process neurons on a dual-core, but haven't been very impressed with the results so far.

The GPU however seems very promising. It actually requires a lot of threads and blocks to become efficient. My 8600GT has 128 little processors ready to work. The GPU basically organizes units of work into blocks, so that those processors can cooperate amongst little units of work. And those blocks are allocated threads, which could also be compared to the smallest unit of work within a block, a single iteration that is to be executed by that block.

The GPU of Nvidia is mostly data-parallel driven. You decide what you want to do and then run a very simple function by a single thread. The activation of a neuron by another neuron is an example of such a very simple function.

There are a lot of hardware-related optimizations that need to be taken into account. Ideally, the architecture of parallel systems may synchronize within blocks, but should never synchronize inbetween blocks themselves to prevent deadlock situations, plus that synchronization is a killer for performance.

The biggest problem for making graphics cards very useful for A.I. is the memory storage capacity *or* the bandwidth between the host memory / disk and the graphical card memory. It's basically 8 Gb/s on a standard PC with PCIe card, whilst internally on the card the bandwidth from it's memory to GPU is orders of magnitude higher, about 60 - 80 Gb/s. So staying on the card for calculations is definitely better for performance. The bandwidth to CPU memory is about 6.4 Gb/s by the way, so it's faster writing to the graphics card than reading/writing to its own memory.

The card contains 256MB of memory. If 4 bytes are used for various information needs like fatigue, threshold and excitation information, then it can store 67 million neurons on one card. It might be possible to use an extremely clever scheme to store connection information, because that information is where most memory is lost on. If you assume 1,000 connections per neuron, that is where 4,000 bytes of information are lost per neuron due to pointer size. Maybe a clever scheme where the neurons are repositioned after each cycle may help to reduce the need for such capacity.

Thus, assuming 4000 bytes for each neuron on average without such optimization or clever scheming, the network can only be 16,750 neurons in size at a maximum.

The design is the most interesting part of this challenge. Coding in CUDA isn't very special, although someone showed that from an initially 2x increase in processing power you can actually attain 20x increase in processing power if you really know how to optimize the CUDA code. So it's worth understanding the architecture of the GPU thoroughly, otherwise you're just hitting walls.

No comments: