Hi!

On Tue, Jan 08, 2013 at 01:26:00AM +0400, Alexander Kozlovsky wrote:
> 2) What is a typical data size of structures used in modern Monte-Carlo Go
> engines?
> - size of UCT tree node;

  You will need at least UCT and RAVE values (#of playouts and value or
#of wins), move coordinates and say another extra byte for some flags,
per child, and pointer to its children list.

> - size of board representation;

  The bare minimum of course is two bits per point. Usually, an extra
edge frame (or half-frame) is added so that you do not have to
special-case border areas.

  Then again, for efficient implementation, you will also need at least
some representation of stone chains and chain membership information
for each point. For each chain, you will want to keep at least
approximate number of liberties and fast way to get liberties of
low-liberty chains (at least the last two libs). You will need to be
able to quickly iterate over all points belonging to a chain (for
captures), e.g. by having another "next-stone-in-chain" information
per point.

  You may also want to maintain queue of free positions for random move
selection, and some extra cached data (e.g. I cache number of neighbors
of various colors).

  Then, you'll need to guard against superko by having a hash code per
point+color and hash map of encountered positions.

> - size of patterns library used for playout;

  A bare minimum for good computer player is 3x3 patterns around last
move, two bits per point, i.e. 16 bits per pattern. (Of course, you can
use more bits or slightly larger patterns, or use decision tree instead
of pattern matching.) You can extend this to much larger patterns, more
moves, more features and so on, but that's not so crucial. You'd have
about 8-16 such patterns (up to isomorphism).

> Below I describe the purpose of this second question.
> 
> I have a strong suspicion it is possible to write strong GPU-based Go
> engine.
> Specifcally, I want to use GTX 580 graphics card, based on Nvidia Fermi
> architecture.

  Great that someone else wants to try! I'm curious what comes
out of it.

  I'm one of the two people who tried before and reported negative
results here. I'm still sceptical, but I don't want to squash your
enthusiasm. Quite possibly you will find a few ingenuities we missed,
with sufficient effort.

> Each of this playout will use single warp (32 GPU thread) and typically
> will use
> only first thread of warp, but for several task, such as Zobrist hash
> calculation, pattern matching
> and relatively good next random playout move selection, all warp threads
> will be used.

  You should also be able to fully utilize all threads for tree node
selection. On the other hand, I worry a little about memory latency
when walking the tree.

> To achieve the goal of fitting all necessary data inside GPU crystal, I
> must suffice
> the next restrictions:
> 
> - Sinle playout must use no more then 6 kilobyte of memory (technically
> speaking,
> it will be shared memory of SM (streaming multiprocessor), 8 simultaneous
> independed
> playouts executed on single SM may use up to 48 kilobyte of shared memory
> in total)

  A lot of this memory will be overhead for local variables of the
threads; you also need to store the tree descent stack etc. Even if all
6kb data of memory was for board structure, that's 15 bytes per
intersection on 19x19 (or rather 20x20). The trouble here is the superko
detection I think, you would need some clever ways to cheat here. If you
don't care about 9x9 strength but just about large boards, perhaps some
clever bandaid trick would suffice instead of real superko detection...
but you would still be cutting it very thin.

> I suspect light playout for 13x13 can fit in 1 kilobyte of memory,

  Zobrist hashes alone would occupy 2352 bytes: (14*14)*3*4. Maybe
2028 if you forego the edge frame (but you'll have to do expensive
index recomputation). You may want to think about reducing the 32-bit
entropy, but it's dangerous balancing of collisions vs. memory.

  And this is 13x13. I think engines that can't work on 19x19 aren't
really very exciting nowadays, maybe that's just me.

  The situation really seems very dire here. The tiny shared memory
is the main reason I gave up on GPU Go. It will require some original
thinking to get over these limitations.

> so 6 kilobytes give
> place for ladder evaluation.

  You don't need much memory for ladder evaluation, I think; you need to
store move stack only in case of branches, which are rare when checking
ladders.

> - Hash table with pattern library must fit in 512 kilobyte of GPU L2 cache
> (with total
> L2 cache size 768 kilobyte)

  You will also want the cache to hold some popular parts of the tree,
a "base" board with the current position, etc.

  Patterns are optional. Maybe it's good idea to design your program
around patterns from day zero, but you should be able to attain at least
KGS 1d strength even with just the 3x3 patterns if you forego this.

> - Total UCT tree size must fit in 1 gigabyte of GPU memory (of 1.5 gigabyte
> available).
> I hope single UCT node will take around 512 byte for 13x13 board, so GPU
> memory
> may hold up to 2 million UCT nodes.

  I think this shouldn't be a severe strength limit; if you implement
some (not necessarily even very aggressive) pruning and store your
nodes in an efficient manner, you shouldn't have trouble. Pachi, even
in fairly long games, would rarely hit a 3GiB memory limit, and it
has no pruning and extremely stupid and wasteful tree representation.



  If you ever get bored of GPU Go, I think the hardware future lies in
two directions:

  (i) Xeon Phi (Larrabee). I'm so envious of any Go programmers that get
their hands on this beauty first.

  (ii) FPGAs. I think there's been a paper on this a while back, but it
didn't leave me impressed; I don't remember the details. The issue here
is quick access to large external memory for tree storage, and I'm not
sure how easy it would be to create "board memory". So maybe I see
promise here just because I never really got around to learning enough
about real FPGA programming.

  (There's - or will be - also Parallela, but so far it's interesting
more as an education platform or embedded DSP, they would have to take
it in the right direction for it to get really interesting.)

  But getting Go work well on GPU would really be a breakthrough.

-- 
                                Petr "Pasky" Baudis
        For every complex problem there is an answer that is clear,
        simple, and wrong.  -- H. L. Mencken
_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Reply via email to