Oops,

Speed drops more that the (per move) estimate below due to the increase in
length of the game as well, of course.

René

On Mon, Aug 24, 2009 at 5:36 PM, René van de Veerdonk <
rene.vandeveerd...@gmail.com> wrote:

> Lukasz,
> I tested Libego version 0.125 on my labtop, compiled using Visual Studio
> 2008 under Windows XP with slight modifications:
>
> #ifdef NDEBUG section commented out in "testing.h"
> #include <time.h> added to "fast_random.cpp"
>
> From the command-line with "engine -b" I observed 67-71 kpps on the same
> labtop as I used to benchmark the 9x9 bitmap mockup achieving 75-80 kpps.
>
> It seems Libego performs much better (going from 70 to 105 kpps is an
> impressive 50% improvement) with other compilers and OS. For sure it was not
> developed for MS Visual Studio, so my numbers are not an apples to apples
> comparison. But I think it is fair to say that both implementations are at
> least in the same ball-park when compiled as is under Visual Studio in
> Windows XP. I do not have an explanation for why my results are 50% below
> the numbers you quoted other than the difference in compiler (inlining?) and
> OS (32 vs 64-bit?). It would be really interesting to have someone else do a
> comparison. Given that I was natively working and optimizing in Visual
> Studio, I am not expecting a 50% improvement for the bitmap mockup (but
> hopefully it will not decrease 50% either).
>
> As for testing on a 19x19 board, this is not as straightforward for me as
> modifying a single parameter in the program. The basic data-type needs to be
> changed from a single 128-bit variable to an array of three such elements.
> This requires a redefinition of most basic operations, in particular the
> shift operations will take some work as these go across array-elements. In
> the generated assembly I could already see that there are not enough 128-bit
> registers for 9x9 in some functions, and this will only get worse with
> arrays. Dereferencing array elements potentially also add overhead. I am
> going to give it a try, but do not expect results overnight, as this is a
> low intensity hobby for me (note that Antoine's e-mail was in November
> 2006). Optimistically, a 3x speed reduction may be expected, but the
> additional overhead and and switch-logic could easily push the drop to over
> 10x unless branching can be minimized.
>
> If someone could look over the mockup and suggest ways to reduce the amount
> of branching there I would really appreciate it. If someone would eventually
> try to port something like this to a GPU, that would also benefit
> tremendously from a reduction in branching. I find that I need to be very
> conscious about branching during coding, as I have observed a tendency to
> put ever more "if", "do" and "while" statements in my code.
>
> Also, does anybody know if gcc allows the same interface to the
> SSE-instructions as I have adopted in bitmap.cpp?
>
> Thanks,
>
> René
>
> On Mon, Aug 24, 2009 at 4:18 PM, Łukasz Lew <lukasz....@gmail.com> wrote:
>
>> 2009/8/24 René van de Veerdonk <rene.vandeveerd...@gmail.com>:
>> > David,
>> > Confession: I have not tested 19x19. As you have noted, and others
>> before
>> > you over the years, a 19x19 board does not fit in one but three 128-bit
>> > registers and there would be a rather big penalty as a result, perhaps
>> > (likely?) wiping out all of the benefits of bitmaps. Antoine voiced his
>> > disappointment about the speed advantage even on 9x9 in the e-mail to
>> the
>> > list that served as my basis. My hope, as Hideko Kato points out, is in
>> > longer registers or perhaps being able to port this to a GPU. A 64-bit
>> OS
>> > with twice the number registers would also surely help out.
>> Nevertheless, I
>> > was surprised that I was able to get to almost the same level of speed
>> as
>> > Libego provides.
>>
>> Libego provides 105-108 kpps on a single Core2 2.4 GHz on 9x9.
>> and 22-23 kpps on 19x19
>>
>> Can you test your bitmap implementation on 19x19?
>>
>> Lukasz
>>
>> > The goals for this mockup were very humble: (1) to provide a basic
>> > implementation to see what it can do (and if I could do it), and (2)
>> learn
>> > how to work with assembly and/or intrinsic functions (I had never done
>> that
>> > before). It may not be too hard to try 19x19 and just see what happens.
>> I
>> > may attempt this as a follow-up.
>> > René van de Veerdonk
>> > 2009/8/23 David Fotland <fotl...@smart-games.com>
>> >>
>> >> How much would you lose for 19x19 board?  A board representation is not
>> >> very interesting unless it scales to 19 line boards.
>> >>
>> >>
>> >>
>> >> David
>> >>
>> >>
>> >>
>> >> From: computer-go-boun...@computer-go.org
>> >> [mailto:computer-go-boun...@computer-go.org] On Behalf Of René van de
>> >> Veerdonk
>> >> Sent: Sunday, August 23, 2009 1:11 PM
>> >> To: computer-go@computer-go.org
>> >> Subject: [computer-go] Bitmap Go revisited and mockup implementation
>> >>
>> >>
>> >>
>> >> Hi all,
>> >>
>> >>
>> >>
>> >> After years of reading this very interesting list, I decided to make my
>> >> first contribution this week after reading once again about bitmap go.
>> There
>> >> is no freely available implementation of a bitmap based engine that I
>> know
>> >> of, so here is a start. Note that I am not a professional programmer,
>> >> self-taught, and this is a part-time hobby, so do not expect perfect
>> form.
>> >>
>> >>
>> >>
>> >> The attachment is an attempt to create an efficient implementation of a
>> >> bitmap based light playout engine, following mostly the recommendations
>> as
>> >> spelled out by Antoine de Maricourt in his November 2006 message to
>> this
>> >> list. The application is single threaded and achieves around 75-80 kpps
>> on
>> >> my Centrino Duo 2.4 GHz Intel labtop. The mockup was developed in C/C++
>> >> using the Microsoft Visual Studio 2008 IDE for Windows XP (32-bit) and
>> >> undoubtedly suffers from portability issues as a result. The 128-bit
>> SSE
>> >> instructions are called through intrinsic functions, hopefully at least
>> >> these interfaces are the same for other compilers. The single goal of
>> this
>> >> mockup was to do some benchmarking, so there is no gtp-interface, but
>> that
>> >> would be rather easy to add for those skilled in the art. The standard
>> rules
>> >> are used with one exception, the Brown eye-rule is used. Multi-stone
>> suicide
>> >> is explicitly forbidden. No superko checks are performed during
>> playouts,
>> >> and there is no mercy-rule either.
>> >>
>> >>
>> >>
>> >> I would be particularly interested to know if a 64-bit OS will improve
>> the
>> >> performance significantly, as it does have a lot less register
>> pressure.
>> >> Moving to a 64-bit OS will also make some instructions available that
>> allow
>> >> for further simplifications/efficiency improvements in the code. As
>> will
>> >> enabling SSE3 or SSE4.1 instructions.
>> >>
>> >>
>> >>
>> >> One note, the random number generator used is from Agner Fog's public
>> >> library and was not included in the attachment. You will have to
>> provide a
>> >> random number generator yourself or download the appropriate version
>> from
>> >> his well-known website (www.agner.org/optimize)
>> >>
>> >>
>> >>
>> >> I hope this is the start of a discussion about further improvements
>> that
>> >> can be made, but you may use this code as you wish with no restrictions
>> or
>> >> guarantees. See the readme.txt file included in the attachment for more
>> >> details (also partially included below).
>> >>
>> >>
>> >>
>> >> Comments and feedback welcome,
>> >>
>> >>
>> >>
>> >> René van de Veerdonk
>> >>
>> >>
>> >>
>> >> Excerpts from the readme.txt file:
>> >>
>> >>
>> >>
>> >> Introduction:
>> >>
>> >> =============
>> >>
>> >>
>> >>
>> >> Mockup of a high performance implementation of the game of Go using
>> >> bitmaps
>> >>
>> >> as the principle datastructures. This implementation is limited to
>> >> measuring
>> >>
>> >> the performance of the principle components by playing random playouts
>> >> from
>> >>
>> >> an empty 9x9 board, using Chinese scoring, not allowing pass moves
>> until
>> >>
>> >> no other moves are available. Suicide is not allowed, neither for
>> single
>> >>
>> >> or multi-stone groups. No checks are implemented for superko, but
>> simple
>> >>
>> >> ko's are disallowed explicitly.
>> >>
>> >>
>> >>
>> >> Some implementation details:
>> >>
>> >> ============================
>> >>
>> >>
>> >>
>> >> Benchmarking is done in test_board.cpp, where the number of playouts
>> (n)
>> >> and
>> >>
>> >> repeats (j) for each test is set. Benchmarking is done starting from an
>> >> empty
>> >>
>> >> 9x9 board until two (three with ko) passes are played in sequence. The
>> >>
>> >> benchmark playout follow the same scheme as method
>> >> board_t::play_random_game,
>> >>
>> >> but with timing instrumentation added.
>> >>
>> >>
>> >>
>> >> Clock counts (cc) are measured using the cc_time_of macro. Total
>> run-time
>> >>
>> >> measurement relies on the windows library.
>> >>
>> >>
>> >>
>> >> For each move during a playout, the program goes through a sequence of
>> >>
>> >> (1) identify all legal moves (~50cc)
>> >>
>> >>     this follows Brown's standard definition; this definition differs
>> from
>> >>
>> >>     most Monte-Carlo engines and has a different set of pros and cons;
>> >>
>> >>     bitmaps allow all moves to be processed in parallel
>> >>
>> >> (2) pick a uniformly random move from the available choices (~75cc)
>> >>
>> >>     relies on a modulus operation; is not 100% uniform to save speed
>> >>
>> >> (3) playing it on the board (~170cc)
>> >>
>> >>     lengthiest method; but rather straightforward
>> >>
>> >>
>> >>
>> >> Playouts end after 256 moves or whenever two passes are played in
>> >> succession
>> >>
>> >> (three in case of a ko). The routines for each of these basic steps are
>> >>
>> >> contained in the file board.cpp. The underlying datastructures are
>> 128-bit
>> >>
>> >> bitmaps, for which the sse2 instructions used to manipulate them are
>> >> contained
>> >>
>> >> in the file bitmap.cpp.
>> >>
>> >>
>> >>
>> >> Further improvements can be expected from using any 64-bit OS that
>> >> supports
>> >>
>> >> several more assembly instructions, and from using sse4.1 instructions.
>> >>
>> >>
>> >>
>> >> Example output (on a Intel 2.40 GHz Centrino Duo labtop):
>> >>
>> >> =========================================================
>> >>
>> >>
>> >>
>> >> [game] = 30469.7, [moves] = 111.071
>> >>
>> >> [game] = 30245.4, [moves] = 111.068
>> >>
>> >> [game] = 30221.7, [moves] = 111.089
>> >>
>> >> [game] = 30264.8, [moves] = 111.122
>> >>
>> >> [game] = 30205.8, [moves] = 111.101
>> >>
>> >>
>> >>
>> >> [game] = 77.1084 kpps, [moves] = 111.023
>> >>
>> >> [game] = 78.0488 kpps, [moves] = 111.045
>> >>
>> >> [game] = 77.1084 kpps, [moves] = 111.046
>> >>
>> >> [game] = 78.0488 kpps, [moves] = 111.131
>> >>
>> >> [game] = 78.0488 kpps, [moves] = 111.082
>> >>
>> >>
>> >>
>> >> [legal] 110/51, [pick] 110/74, [play] 106/169, [score] 40, [win] 0.4187
>> >>
>> >> [legal] 111/51, [pick] 111/74, [play] 106/168, [score] 42, [win] 0.4201
>> >>
>> >> [legal] 111/52, [pick] 111/74, [play] 106/169, [score] 43, [win] 0.4276
>> >>
>> >> [legal] 111/52, [pick] 111/75, [play] 106/170, [score] 41, [win] 0.4092
>> >>
>> >> [legal] 111/51, [pick] 111/74, [play] 106/171, [score] 45, [win] 0.4221
>> >>
>> >>
>> >>
>> >> finished, press any key to exit
>> >>
>> >>
>> >>
>> >> Explanation of output:
>> >>
>> >> ======================
>> >>
>> >>
>> >>
>> >> There are three sets of repeated (5x) measurements, performed and timed
>> >>
>> >> independently to avoid cross-contamination of inserted serialization
>> >>
>> >> instructions and timing overhead (which still does not quite prevent
>> >>
>> >> reordering of instructions across timing boundaries).
>> >>
>> >>
>> >>
>> >> [game]  cc's per full playout
>> >>
>> >> [moves] average number of moves per playout
>> >>
>> >>
>> >>
>> >> [game]  thousands of playouts per second (kpps)
>> >>
>> >> [moves] average number of moves per playout
>> >>
>> >>
>> >>
>> >> "calls per playout" / "cc's per call"
>> >>
>> >> [legal] identifies all legal moves on the board
>> >>
>> >> [pick]  picks a random move
>> >>
>> >> [play]  processes a move
>> >>
>> >> [score] scores a final position
>> >>
>> >> [win]   winrate for black
>> >>
>> >>
>> >>
>> >> Acknowledgements:
>> >>
>> >> =================
>> >>
>> >>
>> >>
>> >> Antoine de Maricourt, for his very helpful description of his bitmap go
>> >>
>> >>   implementation in detail on the computer-go mailing list:
>> >>
>> >>
>> http://computer-go.org/pipermail/computer-go/2006-November/007136.html
>> >>
>> >> Gunnar Farneback, for providing the well documented Brown engine,
>> >>
>> >>   whoms legal move definition is used in this mockup
>> >>
>> >> Lukasz Lew, for the high performance Libego implementation, which has
>> >> inspired
>> >>
>> >>   various parts of this mockup as well
>> >>
>> >> Agner Fog, for the Mersenne Twister and clock-cycle timing library
>> >>
>> >> Computer-go mailing list, where many ideas are posted and discussed
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> computer-go mailing list
>> >> computer-go@computer-go.org
>> >> http://www.computer-go.org/mailman/listinfo/computer-go/
>> >
>> >
>> > _______________________________________________
>> > computer-go mailing list
>> > computer-go@computer-go.org
>> > http://www.computer-go.org/mailman/listinfo/computer-go/
>> >
>> _______________________________________________
>> computer-go mailing list
>> computer-go@computer-go.org
>> http://www.computer-go.org/mailman/listinfo/computer-go/
>>
>
>
_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to