On 20-nov-08, at 14:42, Magnus Persson wrote:
I think you understand the basic principle of RAVE correctly. The
hard part is how to weight together the AMAF value (which I call
*virtual win-visit* ratio) with the actual win-visit ratio. And I
have to admit that my implementation of RAVE in Valkyria might be a
quite unique interpretation of the somewhat hard to understand
information that has been posted here. I just implemented something
that seemed to work and tested the parameters thoroughly. But I
think I need to do new tests since there has been so many changes.
Especially I should do test on large boards.
Valkyria postpones expansion of new nodes about 10 times, and an
expansion includes all children of a position. This is necessary in
order to update virtual win-visit scores for all moves played of
the color for the current simulation even if there is only one.
Thus a node in the tree is a list of children for a given position.
This simplifies code that bias the tree search, since all necessary
algorithms are executed simultaneously for all children during
expansion. This is perhaps not efficient, but is perhaps less prone
to bugs.
The effect of RAVE with a complete expansion is that after a few
simulations (let us say 40) we have a rather good sample of virtual
win-visit scores for all moves. Also if patterns biases exploration
all real win-visit scores has been collected for those moves (often
a single move) that are most likely to be the best. Thus the win
rate for this position is already close to that of the best move.
This is different from pure MC-UCT where all candidate moves are
searched once and the win-rate initially is the average of all
moves, bad as well as good ones. It takes a lot of time until MC-
UCT will converge on to the winrate of the best move in each position.
Simply put combining virtual and real win-visit values for
searching the tree lets us safely ignore the worst moves in all
positions and the search basically only searches moves that our
pattern heuristics predicted to be good or moves that were
discovered thanks to AMAF.
The search is very selective when it works. And of course it often
misses the best move. But it kind of works because a) often the
second best move wins as well b) the selective search searches so
deep effectively so it corrects itself quickly.
Again, thanks for sharing. So it seems by and large I understood it
in a similar fashion as you did :-)
What is not exactly clear to me is what you mean by 'postponing
expansion'. Let me write it in my own words to see if that's what you
mean. When you have selected a best node based on the UCT + wins/
visits value which has no children yet, you first simply do a
simulation and collect the playout result in the current node,
including the AMAF value that you call 'virtual win-visit ratio, and
only when that is done a certain number of times (in your case 10) do
you suddenly create all the children and weight them based on the
virtual win-visit ration and possibly weight them based on other move-
priorities that resulted from 'heavy' playout selection?
When thinking about it this way it makes sense. I already had a
nagging feeling about the requirement of visiting each move at least
once. Especially on 19x19 I could see that becoming a burden.
Mark
_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/