Oops, Speed drops more that the (per move) estimate below due to the increase in length of the game as well, of course.
René On Mon, Aug 24, 2009 at 5:36 PM, René van de Veerdonk < rene.vandeveerd...@gmail.com> wrote: > Lukasz, > I tested Libego version 0.125 on my labtop, compiled using Visual Studio > 2008 under Windows XP with slight modifications: > > #ifdef NDEBUG section commented out in "testing.h" > #include <time.h> added to "fast_random.cpp" > > From the command-line with "engine -b" I observed 67-71 kpps on the same > labtop as I used to benchmark the 9x9 bitmap mockup achieving 75-80 kpps. > > It seems Libego performs much better (going from 70 to 105 kpps is an > impressive 50% improvement) with other compilers and OS. For sure it was not > developed for MS Visual Studio, so my numbers are not an apples to apples > comparison. But I think it is fair to say that both implementations are at > least in the same ball-park when compiled as is under Visual Studio in > Windows XP. I do not have an explanation for why my results are 50% below > the numbers you quoted other than the difference in compiler (inlining?) and > OS (32 vs 64-bit?). It would be really interesting to have someone else do a > comparison. Given that I was natively working and optimizing in Visual > Studio, I am not expecting a 50% improvement for the bitmap mockup (but > hopefully it will not decrease 50% either). > > As for testing on a 19x19 board, this is not as straightforward for me as > modifying a single parameter in the program. The basic data-type needs to be > changed from a single 128-bit variable to an array of three such elements. > This requires a redefinition of most basic operations, in particular the > shift operations will take some work as these go across array-elements. In > the generated assembly I could already see that there are not enough 128-bit > registers for 9x9 in some functions, and this will only get worse with > arrays. Dereferencing array elements potentially also add overhead. I am > going to give it a try, but do not expect results overnight, as this is a > low intensity hobby for me (note that Antoine's e-mail was in November > 2006). Optimistically, a 3x speed reduction may be expected, but the > additional overhead and and switch-logic could easily push the drop to over > 10x unless branching can be minimized. > > If someone could look over the mockup and suggest ways to reduce the amount > of branching there I would really appreciate it. If someone would eventually > try to port something like this to a GPU, that would also benefit > tremendously from a reduction in branching. I find that I need to be very > conscious about branching during coding, as I have observed a tendency to > put ever more "if", "do" and "while" statements in my code. > > Also, does anybody know if gcc allows the same interface to the > SSE-instructions as I have adopted in bitmap.cpp? > > Thanks, > > René > > On Mon, Aug 24, 2009 at 4:18 PM, Łukasz Lew <lukasz....@gmail.com> wrote: > >> 2009/8/24 René van de Veerdonk <rene.vandeveerd...@gmail.com>: >> > David, >> > Confession: I have not tested 19x19. As you have noted, and others >> before >> > you over the years, a 19x19 board does not fit in one but three 128-bit >> > registers and there would be a rather big penalty as a result, perhaps >> > (likely?) wiping out all of the benefits of bitmaps. Antoine voiced his >> > disappointment about the speed advantage even on 9x9 in the e-mail to >> the >> > list that served as my basis. My hope, as Hideko Kato points out, is in >> > longer registers or perhaps being able to port this to a GPU. A 64-bit >> OS >> > with twice the number registers would also surely help out. >> Nevertheless, I >> > was surprised that I was able to get to almost the same level of speed >> as >> > Libego provides. >> >> Libego provides 105-108 kpps on a single Core2 2.4 GHz on 9x9. >> and 22-23 kpps on 19x19 >> >> Can you test your bitmap implementation on 19x19? >> >> Lukasz >> >> > The goals for this mockup were very humble: (1) to provide a basic >> > implementation to see what it can do (and if I could do it), and (2) >> learn >> > how to work with assembly and/or intrinsic functions (I had never done >> that >> > before). It may not be too hard to try 19x19 and just see what happens. >> I >> > may attempt this as a follow-up. >> > René van de Veerdonk >> > 2009/8/23 David Fotland <fotl...@smart-games.com> >> >> >> >> How much would you lose for 19x19 board? A board representation is not >> >> very interesting unless it scales to 19 line boards. >> >> >> >> >> >> >> >> David >> >> >> >> >> >> >> >> From: computer-go-boun...@computer-go.org >> >> [mailto:computer-go-boun...@computer-go.org] On Behalf Of René van de >> >> Veerdonk >> >> Sent: Sunday, August 23, 2009 1:11 PM >> >> To: computer-go@computer-go.org >> >> Subject: [computer-go] Bitmap Go revisited and mockup implementation >> >> >> >> >> >> >> >> Hi all, >> >> >> >> >> >> >> >> After years of reading this very interesting list, I decided to make my >> >> first contribution this week after reading once again about bitmap go. >> There >> >> is no freely available implementation of a bitmap based engine that I >> know >> >> of, so here is a start. Note that I am not a professional programmer, >> >> self-taught, and this is a part-time hobby, so do not expect perfect >> form. >> >> >> >> >> >> >> >> The attachment is an attempt to create an efficient implementation of a >> >> bitmap based light playout engine, following mostly the recommendations >> as >> >> spelled out by Antoine de Maricourt in his November 2006 message to >> this >> >> list. The application is single threaded and achieves around 75-80 kpps >> on >> >> my Centrino Duo 2.4 GHz Intel labtop. The mockup was developed in C/C++ >> >> using the Microsoft Visual Studio 2008 IDE for Windows XP (32-bit) and >> >> undoubtedly suffers from portability issues as a result. The 128-bit >> SSE >> >> instructions are called through intrinsic functions, hopefully at least >> >> these interfaces are the same for other compilers. The single goal of >> this >> >> mockup was to do some benchmarking, so there is no gtp-interface, but >> that >> >> would be rather easy to add for those skilled in the art. The standard >> rules >> >> are used with one exception, the Brown eye-rule is used. Multi-stone >> suicide >> >> is explicitly forbidden. No superko checks are performed during >> playouts, >> >> and there is no mercy-rule either. >> >> >> >> >> >> >> >> I would be particularly interested to know if a 64-bit OS will improve >> the >> >> performance significantly, as it does have a lot less register >> pressure. >> >> Moving to a 64-bit OS will also make some instructions available that >> allow >> >> for further simplifications/efficiency improvements in the code. As >> will >> >> enabling SSE3 or SSE4.1 instructions. >> >> >> >> >> >> >> >> One note, the random number generator used is from Agner Fog's public >> >> library and was not included in the attachment. You will have to >> provide a >> >> random number generator yourself or download the appropriate version >> from >> >> his well-known website (www.agner.org/optimize) >> >> >> >> >> >> >> >> I hope this is the start of a discussion about further improvements >> that >> >> can be made, but you may use this code as you wish with no restrictions >> or >> >> guarantees. See the readme.txt file included in the attachment for more >> >> details (also partially included below). >> >> >> >> >> >> >> >> Comments and feedback welcome, >> >> >> >> >> >> >> >> René van de Veerdonk >> >> >> >> >> >> >> >> Excerpts from the readme.txt file: >> >> >> >> >> >> >> >> Introduction: >> >> >> >> ============= >> >> >> >> >> >> >> >> Mockup of a high performance implementation of the game of Go using >> >> bitmaps >> >> >> >> as the principle datastructures. This implementation is limited to >> >> measuring >> >> >> >> the performance of the principle components by playing random playouts >> >> from >> >> >> >> an empty 9x9 board, using Chinese scoring, not allowing pass moves >> until >> >> >> >> no other moves are available. Suicide is not allowed, neither for >> single >> >> >> >> or multi-stone groups. No checks are implemented for superko, but >> simple >> >> >> >> ko's are disallowed explicitly. >> >> >> >> >> >> >> >> Some implementation details: >> >> >> >> ============================ >> >> >> >> >> >> >> >> Benchmarking is done in test_board.cpp, where the number of playouts >> (n) >> >> and >> >> >> >> repeats (j) for each test is set. Benchmarking is done starting from an >> >> empty >> >> >> >> 9x9 board until two (three with ko) passes are played in sequence. The >> >> >> >> benchmark playout follow the same scheme as method >> >> board_t::play_random_game, >> >> >> >> but with timing instrumentation added. >> >> >> >> >> >> >> >> Clock counts (cc) are measured using the cc_time_of macro. Total >> run-time >> >> >> >> measurement relies on the windows library. >> >> >> >> >> >> >> >> For each move during a playout, the program goes through a sequence of >> >> >> >> (1) identify all legal moves (~50cc) >> >> >> >> this follows Brown's standard definition; this definition differs >> from >> >> >> >> most Monte-Carlo engines and has a different set of pros and cons; >> >> >> >> bitmaps allow all moves to be processed in parallel >> >> >> >> (2) pick a uniformly random move from the available choices (~75cc) >> >> >> >> relies on a modulus operation; is not 100% uniform to save speed >> >> >> >> (3) playing it on the board (~170cc) >> >> >> >> lengthiest method; but rather straightforward >> >> >> >> >> >> >> >> Playouts end after 256 moves or whenever two passes are played in >> >> succession >> >> >> >> (three in case of a ko). The routines for each of these basic steps are >> >> >> >> contained in the file board.cpp. The underlying datastructures are >> 128-bit >> >> >> >> bitmaps, for which the sse2 instructions used to manipulate them are >> >> contained >> >> >> >> in the file bitmap.cpp. >> >> >> >> >> >> >> >> Further improvements can be expected from using any 64-bit OS that >> >> supports >> >> >> >> several more assembly instructions, and from using sse4.1 instructions. >> >> >> >> >> >> >> >> Example output (on a Intel 2.40 GHz Centrino Duo labtop): >> >> >> >> ========================================================= >> >> >> >> >> >> >> >> [game] = 30469.7, [moves] = 111.071 >> >> >> >> [game] = 30245.4, [moves] = 111.068 >> >> >> >> [game] = 30221.7, [moves] = 111.089 >> >> >> >> [game] = 30264.8, [moves] = 111.122 >> >> >> >> [game] = 30205.8, [moves] = 111.101 >> >> >> >> >> >> >> >> [game] = 77.1084 kpps, [moves] = 111.023 >> >> >> >> [game] = 78.0488 kpps, [moves] = 111.045 >> >> >> >> [game] = 77.1084 kpps, [moves] = 111.046 >> >> >> >> [game] = 78.0488 kpps, [moves] = 111.131 >> >> >> >> [game] = 78.0488 kpps, [moves] = 111.082 >> >> >> >> >> >> >> >> [legal] 110/51, [pick] 110/74, [play] 106/169, [score] 40, [win] 0.4187 >> >> >> >> [legal] 111/51, [pick] 111/74, [play] 106/168, [score] 42, [win] 0.4201 >> >> >> >> [legal] 111/52, [pick] 111/74, [play] 106/169, [score] 43, [win] 0.4276 >> >> >> >> [legal] 111/52, [pick] 111/75, [play] 106/170, [score] 41, [win] 0.4092 >> >> >> >> [legal] 111/51, [pick] 111/74, [play] 106/171, [score] 45, [win] 0.4221 >> >> >> >> >> >> >> >> finished, press any key to exit >> >> >> >> >> >> >> >> Explanation of output: >> >> >> >> ====================== >> >> >> >> >> >> >> >> There are three sets of repeated (5x) measurements, performed and timed >> >> >> >> independently to avoid cross-contamination of inserted serialization >> >> >> >> instructions and timing overhead (which still does not quite prevent >> >> >> >> reordering of instructions across timing boundaries). >> >> >> >> >> >> >> >> [game] cc's per full playout >> >> >> >> [moves] average number of moves per playout >> >> >> >> >> >> >> >> [game] thousands of playouts per second (kpps) >> >> >> >> [moves] average number of moves per playout >> >> >> >> >> >> >> >> "calls per playout" / "cc's per call" >> >> >> >> [legal] identifies all legal moves on the board >> >> >> >> [pick] picks a random move >> >> >> >> [play] processes a move >> >> >> >> [score] scores a final position >> >> >> >> [win] winrate for black >> >> >> >> >> >> >> >> Acknowledgements: >> >> >> >> ================= >> >> >> >> >> >> >> >> Antoine de Maricourt, for his very helpful description of his bitmap go >> >> >> >> implementation in detail on the computer-go mailing list: >> >> >> >> >> http://computer-go.org/pipermail/computer-go/2006-November/007136.html >> >> >> >> Gunnar Farneback, for providing the well documented Brown engine, >> >> >> >> whoms legal move definition is used in this mockup >> >> >> >> Lukasz Lew, for the high performance Libego implementation, which has >> >> inspired >> >> >> >> various parts of this mockup as well >> >> >> >> Agner Fog, for the Mersenne Twister and clock-cycle timing library >> >> >> >> Computer-go mailing list, where many ideas are posted and discussed >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> >> computer-go mailing list >> >> computer-go@computer-go.org >> >> http://www.computer-go.org/mailman/listinfo/computer-go/ >> > >> > >> > _______________________________________________ >> > computer-go mailing list >> > computer-go@computer-go.org >> > http://www.computer-go.org/mailman/listinfo/computer-go/ >> > >> _______________________________________________ >> computer-go mailing list >> computer-go@computer-go.org >> http://www.computer-go.org/mailman/listinfo/computer-go/ >> > >
_______________________________________________ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/