We place a lot of faith in RAVE. Our faith is justified 99% of the time. But when does
faith become overconfidence? Perhaps we just need an exploration approach that is more "fail-safe." For example, an epsilon-greedy strategy to be applied in states that have a lot of exploration but still some moves that have never been examined. The bad case occurs when we think a node is losing, but there is a winning move. So let's wait for the score of the highest move to fall below 50% when using UCT+RAVE. Perhaps we wait for the highest UCB to fall below 50%, if we want to be really conservative. Once a node passes that threshold, I am open to alternative exploration strategies. The problem is that there are three cases: 1) The node is losing. 2) The node is winning, and we have not examined the winning move. 3) The node is winning, and the winning move hasn't been tried enough to find the win. Now, in the first case we have nothing to lose by exploring alternatives, since all are losing. In the other two cases, though, there is a risk/reward ratio. If the winning move has not been searched, then we can benefit from trying new moves. But if the winning move is being tried but has not been searched deeply enough, then we slow ourselves down by trying new moves. RAVE balances the risks by strictly following the rewards from experience. And in most cases this is valid. But it is quite rigid. I am not trying to be theoretically solid, in the way that RAVE is an estimate of winning percentage and can therefore be used as a proxy in the UCT formula. Instead, I would like to have a better worst-case. To review, the worst-case behavior of UCT is to allocate O(log T) trials to each move, where T trials have been done at this node. Since Log T increases without limit, eventually this node converges to the game-theoretic value. However, the worst-case speed of convergence is exponential. And that is just for one node. In a UCT tree, the worst-case speed of convergence is a tower of exponentials. So: really, really bad convergence. In the case of a node that is believed to be bad anyway, can we change the policy so that a fraction of effort goes to alternatives? Perhaps a policy of allocating 10% of trials to the least-taken alternative, after the node is determined to be losing?
_______________________________________________ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/