Christian, Would you care to provide some more detail on your implementation for the playouts? Your results are very impressive. At 19x19 Go using bit-boards, your implementation is roughly 7x as fast as the bitboard implementation I presented just a few weeks back, and also outperforms libEgo by about a factor of two.
René On Wed, Sep 9, 2009 at 2:57 PM, Christian Nentwich < christ...@modeltwozero.com> wrote: > Mark, > > let me try to add some more context to answer your questions. When I say in > my conclusion that "it's not worth it", I mean it's not worth using the GPU > to run playout algorithms of the sort that are in use today. There may be > many other algorithms that form part of Go engines where the GPU can provide > an order-of-magnitude speedup. Still more where the GPU can run in parallel > with the CPU to help. > > In my experiments, a CPU core got 47,000 playouts per second and the GPU > 170,000. But: > - My computer has two cores (so it gets 94,000 playouts with 2 threads) > - My computer's processor (intel core duo 6600) is 3 years old, and far > from state of the art > - My graphics card (Geforce 285) on the other hand, is recently purchased > and one of the top graphics cards > > That means that my old CPU already gets more than twice the speed of the > GPU. An Intel Nehalem processor would surely beat it, let alone an 8-core > system. Bearing in mind the severe drawbacks of the GPU - these are not > general purpose processors, there is much you can't do on them - this limits > their usefulness with this algorithm. Compare this speedup to truly highly > parallel algorithms: random number generation, matrix multiplication, > monte-carlo simulation of options (which are highly parallel because there > is no branching and little data); you see speedups of 10x to 100x over the > CPU with those. > > The 9% occupancy may be puzzling but there is little that can be done about > that. This, and the talk about threads and blocks would take a while to > explain, because GPUs don't work like general purpose CPUs. They are SIMD > processors meaning that each processor can run many threads in parallel on > different items of data but only if *all threads are executing the same > instruction*. There is only one instruction decoding stage per processor > cycle. If any "if" statements or loops diverge, threads will be serialised > until they join again. The 9% occupancy is a function of the amount of data > needed to perform the task, and the branch divergence (caused by the > playouts being different). There is little that can be done about it other > than use a completely different algorithm. > > If you google "CUDA block threads" you will find out more. In short, the > GPU runs like a grid cluster. In each block, 64 threads run in parallel, > conceptually. On the actual hardware, in each processor 16 threads from one > block will execute followed by 16 from another ("half-warps"). If any > threads are blocked (memory reads costs ~400 cycles!) then threads from > another block are scheduled instead. So the answer is: yes, there are 64 * > 80 threads conceptually but they're not always scheduled at the same time. > > Comments on specific questions below. > >> If paralellism is what you're looking for, why not have one thread per >> move candidate? Use that to collect AMAF statistics. 16Kb is not a lot >> to work with, so the statistics may have to be shared. >> >> > One thread per move candidate is feasible with the architecture I used, > since every thread has its own board. I have not implemented AMAF, so I > cannot comment on the statistics bit, but the "output" of your algorithm is > typically not in the 16k shared memory anyway. You'd write that to global > memory (1GB). Would uniform random playouts be good enough for this though? > > Another question I'd have is whether putting two graphics card would >> double the capacity. >> >> > Yes it would. It would pretty much precisely double it (the "grid" to > schedule over just gets larger, but there is no additional overhead). > > Did you try this for 9x9 or 19x19? >> >> > I used 19x19. If you do it for 9x9, you can probably run 128 threads per > block because of the smaller board representation. The speedup would be > correspondingly larger (4x or more). I chose 19x19 because of the severe > memory limitations of the architecture; it seemed that 9x9 would just make > my life a bit too easy for comfort... > > > Christian > > _______________________________________________ > computer-go mailing list > computer-go@computer-go.org > http://www.computer-go.org/mailman/listinfo/computer-go/ >
_______________________________________________ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/