Re: [computer-go] CUDA and GPU Performance

Christian Nentwich Thu, 10 Sep 2009 04:12:10 -0700

Mark,

you are right, I meant to type "more than half the speed" in my mail.That would be enough to rule it out for me, it's just not worth it.

Your idea on using this for RAVE values may be useful. But I think someflexible thinking around using the GPU as a resource for tangentialtasks rather than a replacement resource may also be very fruitful.

For example, what would be the value of an "oracle" than can match everynode in the top 5 layers of your tree against patterns for additionalscoring? I have no idea, but that may be the sort of thing that is fast.

The limitations causing GPU bottlenecks in the playout algorithms willbe lessened with pattern matching. Say, for example, that you only needthe raw board (no string information, etc.) per thread, and that thepatterns are in global memory. Assume also that you use asuper-compressed form of the board, which will require (19 * 19 * 2) / 8bytes. That's 91 bytes per thread. This means you can run 181 threadswith your 16k of memory (this number will go down in practice becausethe compiler will use the shared memory for some information).

If your threads are well behaved and all search the same pattern list inorder, your performance may be good because of memory read coalescing. Aquick calculation with the CUDA occupancy calculator predicts aprocessor occupancy of 25%. That's about as good as you're going to getwithout getting more sophisticated and breaking the board itself down.


Christian


Mark Boon wrote:

Thank you Christian, for taking the time to write an extensive reply.

I still don't understand how you come to conclude that the CPU, at 94K
playouts with two cores, is twice as fast as the GPU doing 170K
playouts per second. Sounds like the reverse to me. Or you meant to
say "more than half the speed"?

There's a lot more I don't quite understand, but I guess I'd have to
get a text-book on GPU processing to answer those.

With regards to your question whether uniformly random playouts would
be good enough to collect AMAF statistics I'd have to give a bit of a
speculative answer. In principle it's good enough. The ref-bots play
quite a decent game just with a few thousand playouts based on AMAF.
But it's obvious heavy playouts gives much stronger play. But I can
see a situation where the CPU does heavy playouts while the GPU
gathers AMAF statistics for each node to compute a RAVE value to help
in the move ordering. Like a two-staged playout.

When I consider my own implementation of playouts, it does a lot of
branching figuring out the liberties. If this can be made an order of
magnitude faster by a devious implementation that minimizes branching
there might be something to it using the two-staged playout approach.
And since it doesn't burden the CPU (much) you basically get it free.

Mark


On Wed, Sep 9, 2009 at 11:57 AM, Christian Nentwich
<christ...@modeltwozero.com> wrote:

Mark,

let me try to add some more context to answer your questions. When I say in
my conclusion that "it's not worth it", I mean it's not worth using the GPU
to run playout algorithms of the sort that are in use today. There may be
many other algorithms that form part of Go engines where the GPU can provide
an order-of-magnitude speedup. Still more where the GPU can run in parallel
with the CPU to help.

In my experiments, a CPU core got 47,000 playouts per second and the GPU
170,000. But:
 - My computer has two cores (so it gets 94,000 playouts with 2 threads)
 - My computer's processor (intel core duo 6600) is 3 years old, and far
from state of the art
 - My graphics card (Geforce 285) on the other hand, is recently purchased
and one of the top graphics cards

That means that my old CPU already gets more than twice the speed of the
GPU. An Intel Nehalem processor would surely beat it, let alone an 8-core
system. Bearing in mind the severe drawbacks of the GPU - these are not
general purpose processors, there is much you can't do on them - this limits
their usefulness with this algorithm. Compare this speedup to truly highly
parallel algorithms: random number generation, matrix multiplication,
monte-carlo simulation of options (which are highly parallel because there
is no branching and little data); you see speedups of 10x to 100x over the
CPU with those.

The 9% occupancy may be puzzling but there is little that can be done about
that. This, and the talk about threads and blocks would take a while to
explain, because GPUs don't work like general purpose CPUs. They are SIMD
processors meaning that each processor can run many threads in parallel on
different items of data but only if *all threads are executing the same
instruction*. There is only one instruction decoding stage per processor
cycle. If any "if" statements or loops diverge, threads will be serialised
until they join again. The 9% occupancy is a function of the amount of data
needed to perform the task, and the branch divergence (caused by the
playouts being different). There is little that can be done about it other
than use a completely different algorithm.

If you google "CUDA block threads" you will find out more. In short, the GPU
runs like a grid cluster. In each block, 64 threads run in parallel,
conceptually. On the actual hardware, in each processor 16 threads from one
block will execute followed by 16 from another ("half-warps"). If any
threads are blocked (memory reads costs ~400 cycles!) then threads from
another block are scheduled instead. So the answer is: yes, there are 64 *
80 threads conceptually but they're not always scheduled at the same time.

Comments on specific questions below.

If paralellism is what you're looking for, why not have one thread per
move candidate? Use that to collect AMAF statistics. 16Kb is not a lot
to work with, so the statistics may have to be shared.

One thread per move candidate is feasible with the architecture I used,
since every thread has its own board. I have not implemented AMAF, so I
cannot comment on the statistics bit, but the "output" of your algorithm is
typically not in the 16k shared memory anyway. You'd write that to global
memory (1GB). Would uniform random playouts be good enough for this though?

Another question I'd have is whether putting two graphics card would
double the capacity.

Yes it would. It would pretty much precisely double it (the "grid" to
schedule over just gets larger, but there is no additional overhead).

Did you try this for 9x9 or 19x19?

I used 19x19. If you do it for 9x9, you can probably run 128 threads per
block because of the smaller board representation. The speedup would be
correspondingly larger (4x or more). I chose 19x19 because of the severe
memory limitations of the architecture; it seemed that 9x9 would just make
my life a bit too easy for comfort...

Christian

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/



--

Christian Nentwich

Director, Model Two Zero Ltd.
+44-(0)7747-061302
http://www.modeltwozero.com

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] CUDA and GPU Performance

Reply via email to