Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach

Christian Nentwich Thu, 10 Sep 2009 04:26:29 -0700

Petr,

wow, I didn't expect to see so much experimentation being performed!It's great that you have taken the time to implement this approach,because this now shows people both alternatives for implementing aplayout algorithm on the GPU.

I strongly suspect the low performance in the per-intersection case isdown to two reasons - please let me know what you think:- A big proportion of moves place one stone on an empty intersection.In this case 80 of your 81 threads are asleep doing nothing. This meansGPU time is wasted.- In quite a few other cases, we get to the old problem with Go: thatthe intersections are not independent. Thus when you process captures ormerges, you possibly have to lock 80 of 81 threads.

You can't do much about the first. Did you find a solution for thesecond? I was thinking for a while about whether it would be possible tocome up with a highly parallel algorithm for captures/merges thatrequires no lock. That's quite hard so I concentrated on my ownexperiment instead.

I agree with you that your approach could be very performant in patternmatching scenarios. In fact, your work somewhat strengthens myconclusion that doing just straight-forward playouts on the GPU is notthe way to go.


Christian


Petr Baudis wrote:

  I have written a quick CUDA implementation of the per-intersection
GPGPU apprach; Christian's e-mail finally made me polish it up to
a somewhat presentable form.

  In this implementation, each GPU thread maintains a single
intersection.

  The implementation uses 9x9 board (10x11 internally); expanding to
19x19 would probably mean either moving some data to global memory
or splitting the playout of single board to multiple blocks or waiting
for GPUs with larger shared memory. ;-)

  The speed is unfortunately very disappointing; on GTX260, I see
17,300 playouts per second, with 50 playouts running in parallel.
There are several "but"s:

  - With some further obvious optimizations, I'm thinking it should be
    possible to get it to at least 22,000 playouts per second easily.

  - Bit boards are an obvious thing to try, but I'm quite doubtful
    they will be an improvement

  - The playouts are uniformly random now, but they have been
    implemented in a way to make heavier playouts easy; precise liberty
    count is maintained and it should be easy to add pattern matching;
    the move selection can deal with arbitrary probabilities for
    different intersections (I wonder, what are the standard
    high-performance numbers for heavier playouts with pattern matching?)

  - Christian is playing 2650 games in parallel; I'm playing only 50
    games in parallel, in the meantime the CPU can pick next games to
    play from the tree

  - My code already accounts for transferring board images from CPU to
    the GPU (different one for each playout) and getting scores back

  - Christian's GPU seems quite better than mine

  - Apparently I still suck at CUDA optimizations; this has been my
    first real small CUDA project, it would be awesome if someone more
    experienced could look at the code and suggest obvious improvements

  Still, I'm pretty unhappy with the slowness; I wonder how Christian
achieved such a high speed. One problem with my approach is that I have
to make use of a lot of atomic instrinsics (some of them could be worked
around) and __syncthreads() all the time to ensure that all threads have
consistent board image at each stage.

  I think with pattern matching, the per-intersection approach could
shine more, since I can easily match patterns everywhere on the board
at once. Still, I wonder if it will become at least on par with good
CPU implementations.

On Wed, Sep 09, 2009 at 04:54:23PM +0100, Christian Nentwich wrote:

In other words: no tree search is involved, and this is the lightest
possible playout. The raw numbers are as follows:
 - CPU Search: 47,000 playouts per CPU core per second, on an Intel
6600 Core-2 Duo
 - GPU Search: 170,000 playouts per second, on an NVidia Geforce 285 card


  I still find this quite astonishing; since you consider this
experiment a failure anyway, would you mind publishing the source code?
:-)

------------------------------------------------------------------------


_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/



--

Christian Nentwich

Director, Model Two Zero Ltd.
+44-(0)7747-061302
http://www.modeltwozero.com

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach

Reply via email to