Re: [computer-go] Bitmap Go revisited and mockup implementation

René van de Veerdonk Mon, 24 Aug 2009 17:55:19 -0700

Oops,

Speed drops more that the (per move) estimate below due to the increase in
length of the game as well, of course.


René

On Mon, Aug 24, 2009 at 5:36 PM, René van de Veerdonk <
rene.vandeveerd...@gmail.com> wrote:

> Lukasz,
> I tested Libego version 0.125 on my labtop, compiled using Visual Studio
> 2008 under Windows XP with slight modifications:
>
> #ifdef NDEBUG section commented out in "testing.h"
> #include <time.h> added to "fast_random.cpp"
>
> From the command-line with "engine -b" I observed 67-71 kpps on the same
> labtop as I used to benchmark the 9x9 bitmap mockup achieving 75-80 kpps.
>
> It seems Libego performs much better (going from 70 to 105 kpps is an
> impressive 50% improvement) with other compilers and OS. For sure it was not
> developed for MS Visual Studio, so my numbers are not an apples to apples
> comparison. But I think it is fair to say that both implementations are at
> least in the same ball-park when compiled as is under Visual Studio in
> Windows XP. I do not have an explanation for why my results are 50% below
> the numbers you quoted other than the difference in compiler (inlining?) and
> OS (32 vs 64-bit?). It would be really interesting to have someone else do a
> comparison. Given that I was natively working and optimizing in Visual
> Studio, I am not expecting a 50% improvement for the bitmap mockup (but
> hopefully it will not decrease 50% either).
>
> As for testing on a 19x19 board, this is not as straightforward for me as
> modifying a single parameter in the program. The basic data-type needs to be
> changed from a single 128-bit variable to an array of three such elements.
> This requires a redefinition of most basic operations, in particular the
> shift operations will take some work as these go across array-elements. In
> the generated assembly I could already see that there are not enough 128-bit
> registers for 9x9 in some functions, and this will only get worse with
> arrays. Dereferencing array elements potentially also add overhead. I am
> going to give it a try, but do not expect results overnight, as this is a
> low intensity hobby for me (note that Antoine's e-mail was in November
> 2006). Optimistically, a 3x speed reduction may be expected, but the
> additional overhead and and switch-logic could easily push the drop to over
> 10x unless branching can be minimized.
>
> If someone could look over the mockup and suggest ways to reduce the amount
> of branching there I would really appreciate it. If someone would eventually
> try to port something like this to a GPU, that would also benefit
> tremendously from a reduction in branching. I find that I need to be very
> conscious about branching during coding, as I have observed a tendency to
> put ever more "if", "do" and "while" statements in my code.
>
> Also, does anybody know if gcc allows the same interface to the
> SSE-instructions as I have adopted in bitmap.cpp?
>
> Thanks,
>
> René
>
> On Mon, Aug 24, 2009 at 4:18 PM, Łukasz Lew <lukasz....@gmail.com> wrote:
>
>> 2009/8/24 René van de Veerdonk <rene.vandeveerd...@gmail.com>:
>> > David,
>> > Confession: I have not tested 19x19. As you have noted, and others
>> before
>> > you over the years, a 19x19 board does not fit in one but three 128-bit
>> > registers and there would be a rather big penalty as a result, perhaps
>> > (likely?) wiping out all of the benefits of bitmaps. Antoine voiced his
>> > disappointment about the speed advantage even on 9x9 in the e-mail to
>> the
>> > list that served as my basis. My hope, as Hideko Kato points out, is in
>> > longer registers or perhaps being able to port this to a GPU. A 64-bit
>> OS
>> > with twice the number registers would also surely help out.
>> Nevertheless, I
>> > was surprised that I was able to get to almost the same level of speed
>> as
>> > Libego provides.
>>
>> Libego provides 105-108 kpps on a single Core2 2.4 GHz on 9x9.
>> and 22-23 kpps on 19x19
>>
>> Can you test your bitmap implementation on 19x19?
>>
>> Lukasz
>>
>> > The goals for this mockup were very humble: (1) to provide a basic
>> > implementation to see what it can do (and if I could do it), and (2)
>> learn
>> > how to work with assembly and/or intrinsic functions (I had never done
>> that
>> > before). It may not be too hard to try 19x19 and just see what happens.
>> I
>> > may attempt this as a follow-up.
>> > René van de Veerdonk
>> > 2009/8/23 David Fotland <fotl...@smart-games.com>
>> >>
>> >> How much would you lose for 19x19 board?  A board representation is not
>> >> very interesting unless it scales to 19 line boards.
>> >>
>> >>
>> >>
>> >> David
>> >>
>> >>
>> >>
>> >> From: computer-go-boun...@computer-go.org
>> >> [mailto:computer-go-boun...@computer-go.org] On Behalf Of René van de
>> >> Veerdonk
>> >> Sent: Sunday, August 23, 2009 1:11 PM
>> >> To: computer-go@computer-go.org
>> >> Subject: [computer-go] Bitmap Go revisited and mockup implementation
>> >>
>> >>
>> >>
>> >> Hi all,
>> >>
>> >>
>> >>
>> >> After years of reading this very interesting list, I decided to make my
>> >> first contribution this week after reading once again about bitmap go.
>> There
>> >> is no freely available implementation of a bitmap based engine that I
>> know
>> >> of, so here is a start. Note that I am not a professional programmer,
>> >> self-taught, and this is a part-time hobby, so do not expect perfect
>> form.
>> >>
>> >>
>> >>
>> >> The attachment is an attempt to create an efficient implementation of a
>> >> bitmap based light playout engine, following mostly the recommendations
>> as
>> >> spelled out by Antoine de Maricourt in his November 2006 message to
>> this
>> >> list. The application is single threaded and achieves around 75-80 kpps
>> on
>> >> my Centrino Duo 2.4 GHz Intel labtop. The mockup was developed in C/C++
>> >> using the Microsoft Visual Studio 2008 IDE for Windows XP (32-bit) and
>> >> undoubtedly suffers from portability issues as a result. The 128-bit
>> SSE
>> >> instructions are called through intrinsic functions, hopefully at least
>> >> these interfaces are the same for other compilers. The single goal of
>> this
>> >> mockup was to do some benchmarking, so there is no gtp-interface, but
>> that
>> >> would be rather easy to add for those skilled in the art. The standard
>> rules
>> >> are used with one exception, the Brown eye-rule is used. Multi-stone
>> suicide
>> >> is explicitly forbidden. No superko checks are performed during
>> playouts,
>> >> and there is no mercy-rule either.
>> >>
>> >>
>> >>
>> >> I would be particularly interested to know if a 64-bit OS will improve
>> the
>> >> performance significantly, as it does have a lot less register
>> pressure.
>> >> Moving to a 64-bit OS will also make some instructions available that
>> allow
>> >> for further simplifications/efficiency improvements in the code. As
>> will
>> >> enabling SSE3 or SSE4.1 instructions.
>> >>
>> >>
>> >>
>> >> One note, the random number generator used is from Agner Fog's public
>> >> library and was not included in the attachment. You will have to
>> provide a
>> >> random number generator yourself or download the appropriate version
>> from
>> >> his well-known website (www.agner.org/optimize)
>> >>
>> >>
>> >>
>> >> I hope this is the start of a discussion about further improvements
>> that
>> >> can be made, but you may use this code as you wish with no restrictions
>> or
>> >> guarantees. See the readme.txt file included in the attachment for more
>> >> details (also partially included below).
>> >>
>> >>
>> >>
>> >> Comments and feedback welcome,
>> >>
>> >>
>> >>
>> >> René van de Veerdonk
>> >>
>> >>
>> >>
>> >> Excerpts from the readme.txt file:
>> >>
>> >>
>> >>
>> >> Introduction:
>> >>
>> >> =============
>> >>
>> >>
>> >>
>> >> Mockup of a high performance implementation of the game of Go using
>> >> bitmaps
>> >>
>> >> as the principle datastructures. This implementation is limited to
>> >> measuring
>> >>
>> >> the performance of the principle components by playing random playouts
>> >> from
>> >>
>> >> an empty 9x9 board, using Chinese scoring, not allowing pass moves
>> until
>> >>
>> >> no other moves are available. Suicide is not allowed, neither for
>> single
>> >>
>> >> or multi-stone groups. No checks are implemented for superko, but
>> simple
>> >>
>> >> ko's are disallowed explicitly.
>> >>
>> >>
>> >>
>> >> Some implementation details:
>> >>
>> >> ============================
>> >>
>> >>
>> >>
>> >> Benchmarking is done in test_board.cpp, where the number of playouts
>> (n)
>> >> and
>> >>
>> >> repeats (j) for each test is set. Benchmarking is done starting from an
>> >> empty
>> >>
>> >> 9x9 board until two (three with ko) passes are played in sequence. The
>> >>
>> >> benchmark playout follow the same scheme as method
>> >> board_t::play_random_game,
>> >>
>> >> but with timing instrumentation added.
>> >>
>> >>
>> >>
>> >> Clock counts (cc) are measured using the cc_time_of macro. Total
>> run-time
>> >>
>> >> measurement relies on the windows library.
>> >>
>> >>
>> >>
>> >> For each move during a playout, the program goes through a sequence of
>> >>
>> >> (1) identify all legal moves (~50cc)
>> >>
>> >>     this follows Brown's standard definition; this definition differs
>> from
>> >>
>> >>     most Monte-Carlo engines and has a different set of pros and cons;
>> >>
>> >>     bitmaps allow all moves to be processed in parallel
>> >>
>> >> (2) pick a uniformly random move from the available choices (~75cc)
>> >>
>> >>     relies on a modulus operation; is not 100% uniform to save speed
>> >>
>> >> (3) playing it on the board (~170cc)
>> >>
>> >>     lengthiest method; but rather straightforward
>> >>
>> >>
>> >>
>> >> Playouts end after 256 moves or whenever two passes are played in
>> >> succession
>> >>
>> >> (three in case of a ko). The routines for each of these basic steps are
>> >>
>> >> contained in the file board.cpp. The underlying datastructures are
>> 128-bit
>> >>
>> >> bitmaps, for which the sse2 instructions used to manipulate them are
>> >> contained
>> >>
>> >> in the file bitmap.cpp.
>> >>
>> >>
>> >>
>> >> Further improvements can be expected from using any 64-bit OS that
>> >> supports
>> >>
>> >> several more assembly instructions, and from using sse4.1 instructions.
>> >>
>> >>
>> >>
>> >> Example output (on a Intel 2.40 GHz Centrino Duo labtop):
>> >>
>> >> =========================================================
>> >>
>> >>
>> >>
>> >> [game] = 30469.7, [moves] = 111.071
>> >>
>> >> [game] = 30245.4, [moves] = 111.068
>> >>
>> >> [game] = 30221.7, [moves] = 111.089
>> >>
>> >> [game] = 30264.8, [moves] = 111.122
>> >>
>> >> [game] = 30205.8, [moves] = 111.101
>> >>
>> >>
>> >>
>> >> [game] = 77.1084 kpps, [moves] = 111.023
>> >>
>> >> [game] = 78.0488 kpps, [moves] = 111.045
>> >>
>> >> [game] = 77.1084 kpps, [moves] = 111.046
>> >>
>> >> [game] = 78.0488 kpps, [moves] = 111.131
>> >>
>> >> [game] = 78.0488 kpps, [moves] = 111.082
>> >>
>> >>
>> >>
>> >> [legal] 110/51, [pick] 110/74, [play] 106/169, [score] 40, [win] 0.4187
>> >>
>> >> [legal] 111/51, [pick] 111/74, [play] 106/168, [score] 42, [win] 0.4201
>> >>
>> >> [legal] 111/52, [pick] 111/74, [play] 106/169, [score] 43, [win] 0.4276
>> >>
>> >> [legal] 111/52, [pick] 111/75, [play] 106/170, [score] 41, [win] 0.4092
>> >>
>> >> [legal] 111/51, [pick] 111/74, [play] 106/171, [score] 45, [win] 0.4221
>> >>
>> >>
>> >>
>> >> finished, press any key to exit
>> >>
>> >>
>> >>
>> >> Explanation of output:
>> >>
>> >> ======================
>> >>
>> >>
>> >>
>> >> There are three sets of repeated (5x) measurements, performed and timed
>> >>
>> >> independently to avoid cross-contamination of inserted serialization
>> >>
>> >> instructions and timing overhead (which still does not quite prevent
>> >>
>> >> reordering of instructions across timing boundaries).
>> >>
>> >>
>> >>
>> >> [game]  cc's per full playout
>> >>
>> >> [moves] average number of moves per playout
>> >>
>> >>
>> >>
>> >> [game]  thousands of playouts per second (kpps)
>> >>
>> >> [moves] average number of moves per playout
>> >>
>> >>
>> >>
>> >> "calls per playout" / "cc's per call"
>> >>
>> >> [legal] identifies all legal moves on the board
>> >>
>> >> [pick]  picks a random move
>> >>
>> >> [play]  processes a move
>> >>
>> >> [score] scores a final position
>> >>
>> >> [win]   winrate for black
>> >>
>> >>
>> >>
>> >> Acknowledgements:
>> >>
>> >> =================
>> >>
>> >>
>> >>
>> >> Antoine de Maricourt, for his very helpful description of his bitmap go
>> >>
>> >>   implementation in detail on the computer-go mailing list:
>> >>
>> >>
>> http://computer-go.org/pipermail/computer-go/2006-November/007136.html
>> >>
>> >> Gunnar Farneback, for providing the well documented Brown engine,
>> >>
>> >>   whoms legal move definition is used in this mockup
>> >>
>> >> Lukasz Lew, for the high performance Libego implementation, which has
>> >> inspired
>> >>
>> >>   various parts of this mockup as well
>> >>
>> >> Agner Fog, for the Mersenne Twister and clock-cycle timing library
>> >>
>> >> Computer-go mailing list, where many ideas are posted and discussed
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> computer-go mailing list
>> >> computer-go@computer-go.org
>> >> http://www.computer-go.org/mailman/listinfo/computer-go/
>> >
>> >
>> > _______________________________________________
>> > computer-go mailing list
>> > computer-go@computer-go.org
>> > http://www.computer-go.org/mailman/listinfo/computer-go/
>> >
>> _______________________________________________
>> computer-go mailing list
>> computer-go@computer-go.org
>> http://www.computer-go.org/mailman/listinfo/computer-go/
>>
>
>

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Bitmap Go revisited and mockup implementation

Reply via email to