Hi! I know that RAVE data typically used during tree traversing. But is it possible to use it during random playout, in order to increase playout quality?
On the first sight it seems as dangerous idea, because RAVE statistics are incrementally gathered from the same playouts, and this can lead to problematic positive feedback loop, as in saying "The rich get richer and the poor get poorer". That is, random initial fluctuation can get stronger with time and statistics become skewed, because good moves which receive unfortunate initial RAVE data will be ignored in future random playout. But what if we see move selection during random playout as a typical multiarm bandit problem? Then the algorithm of next playout move selection can be the next: 1) select several (say, 4) valid candidate moves for the playout. 2) choose the next move using multiarm bandit formula. We can do this, because for each candidate move we know (a) number of rave wins for this move, (b) number of playouts with this move, (c) total number of playouts (all of this numbers are tied to current UCT node) I think, this should add exploration element to next move selection and prevent skewing of RAVE statistics. I suspect using RAVE data can improve playout strength significantly. Has anybody trying something like this, or it is just crazy idea?
_______________________________________________ Computer-go mailing list [email protected] http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
