Re: [computer-go] CUDA and GPU Performance

Christian Nentwich Thu, 10 Sep 2009 03:39:07 -0700

Rene,

you're absolutely right, it's completely fishy! But don't worry, you'rework is not in vain :) I noticed this morning, when I read your mail,that I had included the 9x9 results in my original mail instead of19x19! Indeed, for 19x19 the results are even worse. Here's a completerundown:

- 9x9 CPU: 47,000 playouts per core per second
- 9x9 GPU: 170,000 playouts per second

- 19x19 CPU: 9,800 playouts per core per second
- 19x19 GPU: 11,000 playouts per second

I did mention in another mail that the performance difference for 9x9should be larger, I think. What I didn't realise was that I had reportedthe 9x9 numbers by mistake!

Additional statistics:
 - Processor occupancy for 19x19 was 6% instead of 9%

- Branch divergence was less than half a percent. It was 2% for 9x9.This is perhaps because of the larger board size causing more moves tofall onto empty intersections, or fewer moves requiring merges/captures.

Christian



René van de Veerdonk wrote:

Christian,

Would you care to provide some more detail on your implementation forthe playouts? Your results are very impressive. At 19x19 Go usingbit-boards, your implementation is roughly 7x as fast as the bitboardimplementation I presented just a few weeks back, and also outperformslibEgo by about a factor of two.

René

On Wed, Sep 9, 2009 at 2:57 PM, Christian Nentwich<christ...@modeltwozero.com <mailto:christ...@modeltwozero.com>> wrote:

    Mark,

    let me try to add some more context to answer your questions. When
    I say in my conclusion that "it's not worth it", I mean it's not
    worth using the GPU to run playout algorithms of the sort that are
    in use today. There may be many other algorithms that form part of
    Go engines where the GPU can provide an order-of-magnitude
    speedup. Still more where the GPU can run in parallel with the CPU
    to help.

    In my experiments, a CPU core got 47,000 playouts per second and
    the GPU 170,000. But:
     - My computer has two cores (so it gets 94,000 playouts with 2
    threads)
     - My computer's processor (intel core duo 6600) is 3 years old,
    and far from state of the art
     - My graphics card (Geforce 285) on the other hand, is recently
    purchased and one of the top graphics cards

    That means that my old CPU already gets more than twice the speed
    of the GPU. An Intel Nehalem processor would surely beat it, let
    alone an 8-core system. Bearing in mind the severe drawbacks of
    the GPU - these are not general purpose processors, there is much
    you can't do on them - this limits their usefulness with this
    algorithm. Compare this speedup to truly highly parallel
    algorithms: random number generation, matrix multiplication,
    monte-carlo simulation of options (which are highly parallel
    because there is no branching and little data); you see speedups
    of 10x to 100x over the CPU with those.

    The 9% occupancy may be puzzling but there is little that can be
    done about that. This, and the talk about threads and blocks would
    take a while to explain, because GPUs don't work like general
    purpose CPUs. They are SIMD processors meaning that each processor
    can run many threads in parallel on different items of data but
    only if *all threads are executing the same instruction*. There is
    only one instruction decoding stage per processor cycle. If any
    "if" statements or loops diverge, threads will be serialised until
    they join again. The 9% occupancy is a function of the amount of
    data needed to perform the task, and the branch divergence (caused
    by the playouts being different). There is little that can be done
    about it other than use a completely different algorithm.

    If you google "CUDA block threads" you will find out more. In
    short, the GPU runs like a grid cluster. In each block, 64 threads
    run in parallel, conceptually. On the actual hardware, in each
    processor 16 threads from one block will execute followed by 16
    from another ("half-warps"). If any threads are blocked (memory
    reads costs ~400 cycles!) then threads from another block are
    scheduled instead. So the answer is: yes, there are 64 * 80
    threads conceptually but they're not always scheduled at the same
    time.

    Comments on specific questions below.

        If paralellism is what you're looking for, why not have one
        thread per
        move candidate? Use that to collect AMAF statistics. 16Kb is
        not a lot
        to work with, so the statistics may have to be shared.

    One thread per move candidate is feasible with the architecture I
    used, since every thread has its own board. I have not implemented
    AMAF, so I cannot comment on the statistics bit, but the "output"
    of your algorithm is typically not in the 16k shared memory
    anyway. You'd write that to global memory (1GB). Would uniform
    random playouts be good enough for this though?


        Another question I'd have is whether putting two graphics card
        would
        double the capacity.

    Yes it would. It would pretty much precisely double it (the "grid"
    to schedule over just gets larger, but there is no additional
    overhead).


        Did you try this for 9x9 or 19x19?

    I used 19x19. If you do it for 9x9, you can probably run 128
    threads per block because of the smaller board representation. The
    speedup would be correspondingly larger (4x or more). I chose
    19x19 because of the severe memory limitations of the
    architecture; it seemed that 9x9 would just make my life a bit too
    easy for comfort...


    Christian

    _______________________________________________
    computer-go mailing list
    computer-go@computer-go.org <mailto:computer-go@computer-go.org>
    http://www.computer-go.org/mailman/listinfo/computer-go/


------------------------------------------------------------------------

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


--

Christian Nentwich

Director, Model Two Zero Ltd.
+44-(0)7747-061302
http://www.modeltwozero.com

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] CUDA and GPU Performance

Reply via email to