Christian,
Would you care to provide some more detail on your implementation for the
playouts? Your results are very impressive. At 19x19 Go using bit-boards,
your implementation is roughly 7x as fast as the bitboard implementation I
presented just a few weeks back, and also outperforms libEgo by about a
factor of two.

René

On Wed, Sep 9, 2009 at 2:57 PM, Christian Nentwich <
christ...@modeltwozero.com> wrote:

> Mark,
>
> let me try to add some more context to answer your questions. When I say in
> my conclusion that "it's not worth it", I mean it's not worth using the GPU
> to run playout algorithms of the sort that are in use today. There may be
> many other algorithms that form part of Go engines where the GPU can provide
> an order-of-magnitude speedup. Still more where the GPU can run in parallel
> with the CPU to help.
>
> In my experiments, a CPU core got 47,000 playouts per second and the GPU
> 170,000. But:
>  - My computer has two cores (so it gets 94,000 playouts with 2 threads)
>  - My computer's processor (intel core duo 6600) is 3 years old, and far
> from state of the art
>  - My graphics card (Geforce 285) on the other hand, is recently purchased
> and one of the top graphics cards
>
> That means that my old CPU already gets more than twice the speed of the
> GPU. An Intel Nehalem processor would surely beat it, let alone an 8-core
> system. Bearing in mind the severe drawbacks of the GPU - these are not
> general purpose processors, there is much you can't do on them - this limits
> their usefulness with this algorithm. Compare this speedup to truly highly
> parallel algorithms: random number generation, matrix multiplication,
> monte-carlo simulation of options (which are highly parallel because there
> is no branching and little data); you see speedups of 10x to 100x over the
> CPU with those.
>
> The 9% occupancy may be puzzling but there is little that can be done about
> that. This, and the talk about threads and blocks would take a while to
> explain, because GPUs don't work like general purpose CPUs. They are SIMD
> processors meaning that each processor can run many threads in parallel on
> different items of data but only if *all threads are executing the same
> instruction*. There is only one instruction decoding stage per processor
> cycle. If any "if" statements or loops diverge, threads will be serialised
> until they join again. The 9% occupancy is a function of the amount of data
> needed to perform the task, and the branch divergence (caused by the
> playouts being different). There is little that can be done about it other
> than use a completely different algorithm.
>
> If you google "CUDA block threads" you will find out more. In short, the
> GPU runs like a grid cluster. In each block, 64 threads run in parallel,
> conceptually. On the actual hardware, in each processor 16 threads from one
> block will execute followed by 16 from another ("half-warps"). If any
> threads are blocked (memory reads costs ~400 cycles!) then threads from
> another block are scheduled instead. So the answer is: yes, there are 64 *
> 80 threads conceptually but they're not always scheduled at the same time.
>
> Comments on specific questions below.
>
>> If paralellism is what you're looking for, why not have one thread per
>> move candidate? Use that to collect AMAF statistics. 16Kb is not a lot
>> to work with, so the statistics may have to be shared.
>>
>>
> One thread per move candidate is feasible with the architecture I used,
> since every thread has its own board. I have not implemented AMAF, so I
> cannot comment on the statistics bit, but the "output" of your algorithm is
> typically not in the 16k shared memory anyway. You'd write that to global
> memory (1GB). Would uniform random playouts be good enough for this though?
>
>  Another question I'd have is whether putting two graphics card would
>> double the capacity.
>>
>>
> Yes it would. It would pretty much precisely double it (the "grid" to
> schedule over just gets larger, but there is no additional overhead).
>
>  Did you try this for 9x9 or 19x19?
>>
>>
> I used 19x19. If you do it for 9x9, you can probably run 128 threads per
> block because of the smaller board representation. The speedup would be
> correspondingly larger (4x or more). I chose 19x19 because of the severe
> memory limitations of the architecture; it seemed that 9x9 would just make
> my life a bit too easy for comfort...
>
>
> Christian
>
> _______________________________________________
> computer-go mailing list
> computer-go@computer-go.org
> http://www.computer-go.org/mailman/listinfo/computer-go/
>
_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to