I'm trying to understand your conclusion. The GPU is more than 3 times faster than the CPU, yet you don't think it's worth it. You also say the card has only 9% occupancy.
I know next to nothing about GPU programming, so take my questions in that stride. >>Optimal speed was at 80 threads per block, with 256 blocks. What does that mean? 80*256 threads? If paralellism is what you're looking for, why not have one thread per move candidate? Use that to collect AMAF statistics. 16Kb is not a lot to work with, so the statistics may have to be shared. Another question I'd have is whether putting two graphics card would double the capacity. Did you try this for 9x9 or 19x19? Thanks for sharing your experiment. Mark On Wed, Sep 9, 2009 at 5:54 AM, Christian Nentwich <christ...@modeltwozero.com> wrote: > > I did quite a bit of testing earlier this year on running playout algorithms > on GPUs. Unfortunately, I am too busy to write up a tech report on it, but I > finally brought myself to take the time to write this e-mail at least. See > bottom for conclusions. > > For performance testing, I used my CPU board representation, and a CUDA port > of the same (with adjustments), to test the following algorithm: > - Clear the board > - Fill the board according to a uniform random policy > - Avoid filling eyes, according to simple neighbour check > - Avoid simple ko > - Count the score and determine the winner > > In other words: no tree search is involved, and this is the lightest > possible playout. The raw numbers are as follows: > - CPU Search: 47,000 playouts per CPU core per second, on an Intel 6600 > Core-2 Duo > - GPU Search: 170,000 playouts per second, on an NVidia Geforce 285 card > > The algorithm running on the GPU is a straight port, with several > optimisations then made to severely restrict memory access. This means the > algorithm is a "naive" sort of parallel algorithm, parallel on a per-board > level like the CPU implementation, rather than per-intersection or some > other sort of highly parallel algorithm. > > Memory access other than shared processor memory carries a severe penalty on > the GPU. Instead, all threads running on the GPU at any one time have to > make do with a fast shared memory of 16834 bytes. So: > - The board was compressed into a bit board, using 2*21 unsigned ints per > thread > - The count of empty, white and black intersections and the ko position was > also in shared memory per thread > - String/group/block type information was in global memory, as there was no > way to store it in 16384 bytes > > Optimal speed was at 80 threads per block, with 256 blocks. The card had > only 9% processor occupancy, due to the shared memory being almost > exhausted. However, branch divergence was at only 2%, which is not bad at > all - suggesting that the form of parallelism may not be a block. This may > be because the "usual" case of a point either being illegal to play, or a > simple play without a need to merge or remove strings is by far the most > common case. > > Conclusions: > > I see these results as broadly negative with the current generation of > technology. Per-board parallelism on a GPU is not worth it compared to the > CPU speed and the severe drawbacks of working on a GPU (testing is hard, > unfamiliar environment for most programmers, lots of time to spend on > optimisation, etc). > > The problems would be severely compounded by trying to integrate any tree > search, or heavy playouts. Trees are almost impossible to construct on a GPU > because pointers cannot be transferred from the host to the GPU. They could > still be represented using arrays, but the random nature of tree access > would cause huge penalties as it would prevent coalesced memory access. > > Highly parallel algorithms (e.g. one thread per intersection) can still be > investigated, but my (unproven!) intuition is that it is not worth it, as > most intersections will be idle on any given move, wasting processor > occupancy time. > > My feeling is that GPUs may have some potential in this area, but possibly > in a supplementary role such as running additional pattern matching in the > background, or driving machine learning components. > > This e-mail is a bit hurried, so.. questions are welcome!! > > Christian > > _______________________________________________ > computer-go mailing list > computer-go@computer-go.org > http://www.computer-go.org/mailman/listinfo/computer-go/ > _______________________________________________ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/