Hi Rene,

Our design is fully pipelined, so we are able to simulate multiple games
simultaneously. The way way in which simulations are run in FPGA and in CPU
is quite different, so direct comparison is not easy. If we want to simulate
just one game, FPGA implementation is not 10x faster, however, if we want
thousands of games simulated for a single board position, than FPGA is 10x
faster. So, we are getting 1500k GAMES/sec, but only in the second sense.
The clock rate of our FPGA board is only 125 MHz, so with better board/chip,
we will still have 10-100 times improvement over the current result.

best,
Fuming

On Wed, Jun 16, 2010 at 1:28 AM, René van de Veerdonk <
[email protected]> wrote:

> Fuming,
>
> Could you please explain your approach a little bit? From the numbers you
> quote, this sounds extreme positive, but I have a hard time understanding
> how you achieve them. Taking 100k playouts/sec for 9x9 on my 2.4 GHz labtop
> for my single-threaded bitmap based light-playout implementation as an
> example, with 110 moves/playout, this results in a little less than 240
> clock-cycle/move. When I quickly looked up the Cyclone III specification, I
> saw that the clock-speed for this FPGA tops out around 240 MHz, yet you
> achieve 15x the throughput, i.e., you are 150x more efficient. This means
> 1.8 clock-cycle/move. Without being able to make use of pipe-lining inside
> the CPU (someone measured ~2 assembly instructions/clock-cycle for my bitmap
> approach), this leads me to questions. First, are you running a single
> threaded application, or playing on multiple boards at once? Second, are you
> just replaying moves, or also generating them on the fly (about half of the
> time is spend there in my implementation, more if you include updating the
> data-structure to make that possible)? Third, are we using the same
> definitions?
>
> For instance, I would find it much more comprehensible to believe that you
> achieved to do 1500k moves/second instead of 1500k playouts/sec (with each
> playout being ~110 moves). 200 clock-cycles/move sounds do-able if you can
> avoid branching, memory lookups, or miscellaneous calculations by creating
> fine-level parallelism in your FPGA-code and specializing functions on a per
> grid-point basis. In a CPU-based application, this results in code-bloat
> that will become counter-productive at some stage, may not be feasible in
> all instances, and is more difficult to maintain. For an FPGA-based
> application, however, this sounds entirely possible (not knowing anything
> about FPGA's).
>
> Thanks,
>
> René van de Veerdonk
>
>
> On Sat, Jun 12, 2010 at 10:37 AM, Fuming Wang <[email protected]> wrote:
>
>>
>> Cyclone III
>> 120,000 logical elements
>> cycle time is linear to the number of moves to finish a game, which is
>> approximately linear to the square of the board size.
>>
>> Fuming
>>
>>
>>> - What FPGA? Virtex-6? Spartan-6?
>>> - What size is the core in LUT's?
>>> - Is your cycle time linear in the board size or in the number of
>>> squares (i.e. quadratic in board size)? Or something else?
>>>
>>> --
>>> GCP
>>> _______________________________________________
>>> Computer-go mailing list
>>> [email protected]
>>> http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
>>>
>>
>>
>>
>> _______________________________________________
>> Computer-go mailing list
>> [email protected]
>> http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
>>
>
>
> _______________________________________________
> Computer-go mailing list
> [email protected]
> http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
>
_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Reply via email to