Re: [computer-go] CUDA and GPU Performance

Vincent Diepeveen Sat, 12 Sep 2009 15:24:49 -0700


On Sep 9, 2009, at 11:57 PM, Christian Nentwich wrote:

Mark,
let me try to add some more context to answer your questions. WhenI say in my conclusion that "it's not worth it", I mean it's notworth using the GPU to run playout algorithms of the sort that arein use today. There may be many other algorithms that form part ofGo engines where the GPU can provide an order-of-magnitude speedup.Still more where the GPU can run in parallel with the CPU to help.
In my experiments, a CPU core got 47,000 playouts per second andthe GPU 170,000. But:- My computer has two cores (so it gets 94,000 playouts with 2threads)


later generion quadcores hardly have a higher ipc than core2.

the core2 dual has a brilliant IPC.

Nehalem is hardly faster than phenom2 nor core2 for diep. It's reallythe compiler quality and tricks with turboboostthat lure the audience (like the experimental testmachine attestsites gets cooled down to far under 20C, entire machineall components, as there is a powerincrease of 10% when moving from25C to 50C, ever seen at home a machine that'scooler than 50C? Additionally the turboboost gets manuallyoverclocked/put to +600Mhz and the RAM is a type of RAMyou can't realy afford bla bla bla). Of course this is multibillioncompanies and every single one of them tries to outdo another

one to look better.

So really you should compare it 1 to 1 powerwise.

the gtx285 then is not so impressive. It's on par with quadcorenehalems in terms of gflops per watt.I wouldn't say it's an outdated gpu, as it is a fast gpu, but forgpgpu it obviously is slow.


The latest AMD gpu is however 4 times better here.

So your result is maximum factor 2 off for the core2 playouts thereat a chip that in other areas is on par with Nehalem.

You beat it factor 2 there.

8 core machines is not a fair compare as those have 2 sockets. So youshould compare that with the 4 tesla setup, it has 960 streamcores.

The only fair Nvidia compare with quadcores is when using tesla. Nowi realize it is like $1800 a piece nearly, which is a lot for a GPUon stereoids,

yet that's a fair compare to be honest.

If we compare things let's compare fair. A 8 core nehalem is themaximum number of cores intel can deliver single machine as a fastmachine.I'm skipping the single memory controller 24 core box now from a yearago (dunnington).

The 8 core setup you really should compare with the Tesla times 4cpu's so that's 960 cores.

In reality you take a 300 euro card now. What's there for 300 eurofrom intel or AMD, not a 3.2Ghz i7-965 that's for sure,

as that thing has a cost of 1000+ euro.

So effectively you lose a factor 2 at most, your thing at the nvidiais still scaling better then.

As for the parallel speedup one would get out of game tree searchwith so many threads versus 4 fast cores, this is a reality.

Yes that's not so efficient yet.

However there is a solution there to get a good speedup (at least forchess) that i figured out on paper. If there is 1 solution i betthere is moreand also solutions for computer-go. The problem as always is gettingfunded to carry something out like that, as software on a gpu doesn'tsell

of course.

- My computer's processor (intel core duo 6600) is 3 years old,and far from state of the art- My graphics card (Geforce 285) on the other hand, is recentlypurchased and one of the top graphics cards
That means that my old CPU already gets more than twice the speedof the GPU. An Intel Nehalem processor would surely beat it, letalone an 8-core system. Bearing in mind the severe drawbacks of theGPU - these are not general purpose processors, there is much youcan't do on them - this limits their usefulness with thisalgorithm. Compare this speedup to truly highly parallelalgorithms: random number generation, matrix multiplication, monte-carlo simulation of options (which are highly parallel becausethere is no branching and little data); you see speedups of 10x to100x over the CPU with those.

matrixmultiplication is not THAT easy to solve well. Initial attemptswere 20% efficient at Nvidia gpu's when using faster approximations(so not stupid simple manner which isugly slow, but FFT wise), and that was for ideal sized matrice. So ifin the lab it already is that inefficient that means there isproblems everywhere.


it's maturing rapidly now however.

The 9% occupancy may be puzzling but there is little that can bedone about that. This, and the talk about threads and blocks wouldtake a while to explain, because GPUs don't work like generalpurpose CPUs. They are SIMD processors meaning that each processorcan run many threads in parallel on different items of data butonly if *all threads are executing the same instruction*. There isonly one instruction decoding stage per processor cycle. If any"if" statements or loops diverge, threads will be serialised untilthey join again. The 9% occupancy is a function of the amount ofdata needed to perform the task, and the branch divergence (causedby the playouts being different). There is little that can be doneabout it other than use a completely different algorithm.
If you google "CUDA block threads" you will find out more. Inshort, the GPU runs like a grid cluster. In each block, 64 threadsrun in parallel, conceptually. On the actual hardware, in eachprocessor 16 threads from one block will execute followed by 16from another ("half-warps"). If any threads are blocked (memoryreads costs ~400 cycles!)


400 cycles is rather optimistic guess.
Nvidia itself only quote for the 8800 was 600 cycles,

yet i assume that's when all other streamcores idle and assumingideal sequential reads from all streamcores,

not crisscross random as you do in reality.

then threads from another block are scheduled instead. So theanswer is: yes, there are 64 * 80 threads conceptually but they'renot always scheduled at the same time.
Comments on specific questions below.
If paralellism is what you're looking for, why not have one threadpermove candidate? Use that to collect AMAF statistics. 16Kb is not alot
to work with, so the statistics may have to be shared.
One thread per move candidate is feasible with the architecture Iused, since every thread has its own board. I have not implementedAMAF, so I cannot comment on the statistics bit, but the "output"of your algorithm is typically not in the 16k shared memory anyway.You'd write that to global memory (1GB). Would uniform randomplayouts be good enough for this though?
Another question I'd have is whether putting two graphics card would
double the capacity.
Yes it would. It would pretty much precisely double it (the "grid"to schedule over just gets larger, but there is no additionaloverhead).

Well here you hit a problem. 2 gpu's have no shared memory betweeneach other. So for computerchess it's not so easy to have 2 gpu'scooperate (say the x2 versions or the 295 version).Note there is memory links between the 2 gpu's. Nvidia and AMD eachhave their own solution there, but it's again a lot of programmingwork to get it to work i'd assume and speedup not

so very brilliant....

Did you try this for 9x9 or 19x19?
I used 19x19. If you do it for 9x9, you can probably run 128threads per block because of the smaller board representation. Thespeedup would be correspondingly larger (4x or more). I chose 19x19because of the severe memory limitations of the architecture; itseemed that 9x9 would just make my life a bit too easy for comfort...

Well we can also play chess on a 4x4 chessboard, or play checkers ona 8x8 board, but it isn't much fun is it. 19x19 is more realistic forgo.

Christian

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] CUDA and GPU Performance

Reply via email to