Hi!

I know that RAVE data typically used during tree traversing.
But is it possible to use it during random playout, in order to
increase playout quality?

On the first sight it seems as dangerous idea, because
RAVE statistics are incrementally gathered from the same
playouts, and this can lead to problematic positive feedback
loop, as in saying "The rich get richer and the poor get poorer".
That is, random initial fluctuation can get stronger with time
and statistics become skewed, because good moves which
receive unfortunate initial RAVE data will be ignored
in future random playout.

But what if we see move selection during random playout
as a typical multiarm bandit problem? Then the algorithm
of next playout move selection can be the next:

1) select several (say, 4) valid candidate moves for the playout.

2) choose the next move using multiarm bandit formula.
We can do this, because for each candidate move we
know (a) number of rave wins for this move, (b) number
of playouts with this move, (c) total number of playouts
(all of this numbers are tied to current UCT node)

I think, this should add exploration element to next move
selection and prevent skewing of RAVE statistics.
I suspect using RAVE data can improve playout strength
significantly.

Has anybody trying something like this, or it is just crazy idea?
_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Reply via email to