On 20-nov-08, at 14:42, Magnus Persson wrote:

I think you understand the basic principle of RAVE correctly. The hard part is how to weight together the AMAF value (which I call *virtual win-visit* ratio) with the actual win-visit ratio. And I have to admit that my implementation of RAVE in Valkyria might be a quite unique interpretation of the somewhat hard to understand information that has been posted here. I just implemented something that seemed to work and tested the parameters thoroughly. But I think I need to do new tests since there has been so many changes. Especially I should do test on large boards.

Valkyria postpones expansion of new nodes about 10 times, and an expansion includes all children of a position. This is necessary in order to update virtual win-visit scores for all moves played of the color for the current simulation even if there is only one. Thus a node in the tree is a list of children for a given position. This simplifies code that bias the tree search, since all necessary algorithms are executed simultaneously for all children during expansion. This is perhaps not efficient, but is perhaps less prone to bugs.

The effect of RAVE with a complete expansion is that after a few simulations (let us say 40) we have a rather good sample of virtual win-visit scores for all moves. Also if patterns biases exploration all real win-visit scores has been collected for those moves (often a single move) that are most likely to be the best. Thus the win rate for this position is already close to that of the best move. This is different from pure MC-UCT where all candidate moves are searched once and the win-rate initially is the average of all moves, bad as well as good ones. It takes a lot of time until MC- UCT will converge on to the winrate of the best move in each position.

Simply put combining virtual and real win-visit values for searching the tree lets us safely ignore the worst moves in all positions and the search basically only searches moves that our pattern heuristics predicted to be good or moves that were discovered thanks to AMAF.

The search is very selective when it works. And of course it often misses the best move. But it kind of works because a) often the second best move wins as well b) the selective search searches so deep effectively so it corrects itself quickly.


Again, thanks for sharing. So it seems by and large I understood it in a similar fashion as you did :-)

What is not exactly clear to me is what you mean by 'postponing expansion'. Let me write it in my own words to see if that's what you mean. When you have selected a best node based on the UCT + wins/ visits value which has no children yet, you first simply do a simulation and collect the playout result in the current node, including the AMAF value that you call 'virtual win-visit ratio, and only when that is done a certain number of times (in your case 10) do you suddenly create all the children and weight them based on the virtual win-visit ration and possibly weight them based on other move- priorities that resulted from 'heavy' playout selection?

When thinking about it this way it makes sense. I already had a nagging feeling about the requirement of visiting each move at least once. Especially on 19x19 I could see that becoming a burden.

Mark

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to