I don't quite understand this. If I try move m0, which we will assume is the only winning move, but on the first simulation it turns out to lose, then from what you seem to be saying I would never try it again?
Also, which move do you try in the 56/100 example: A) wins 56/100 = 0.56 B) wins 5/9 = 0.555 ... Are you saying that you should try move A because the 0.56 > 0.55 ? - Don 2009/4/28 Brian Sheppard <sheppar...@aol.com> > The Mogo team has published that their UCT "exploration coefficient" is > zero, and further state > that this is the optimal setting for Mogo. Other studies have confirmed > that finding. Yet, the > suspicion persists that this result is somehow related to Mogo's structure, > perhaps because it > runs massively parallel or because of some twist in its internals. > > This post provides theoretical and heuristic justification for c==0. First > the theoretical: > > Theorem: In a finite game tree with no cycles, with binary rewards, the UCT > algorithm with c==0 > converges (in the absence of computational limitations) to the game > theoretic optimal policy. > > The proof is by induction on the depth of the tree. The base case is one > ply before a terminal state. > In the base case, UCT will eventually try a winning move if one exists. > Thereafter, UCT will repeat > that move indefinitely because there is no exploration. It follows that the > UCT value of the base > case will converge to the game theoretic value for both winning and losing > states. For the induction > step, assume that we have N > 1 plies remaining. Each trial produces a node > at depth N-1 at most. > (Note: for this to be true, we have to count ply depth according to the > longest path to terminal node.) > With sufficient numbers of trials, each of those nodes will converge to the > game-theoretic value. > This implies that if there is a winning play, it will eventually be > discovered. > > Note that the "binary rewards" condition is required. Without it, the UCT > policy cannot know that > winning is the best possible outcome, so it would have to explore. > > The point of this theorem is that Mogo's is safe; its exploration policy > does not prevent it from > eventually playing perfectly. > > Now, there is no implication in this proof that the c==0 policy is > computationally optimal, or even > efficient. But we do have Mogo's experimental result, so it is worthwhile > to speculate whether > c==0 should be optimal. Some heuristic reasoning follows. > > If UCT has to choose between trying a move that wins 55% and a move that > wins 54%, then why > *shouldn't* it try the move that wins more frequently? What we are trying > to do (at an internal node) > is to prove that our opponent's last play was losing, and we would do this > most efficiently by > sampling the move that has the highest success. > > Another angle: at the root of the tree, we will choose the move that has > the largest number of trials. > We would like that to be a winning move. From the theorem above, we know > that the value of the > moves will converge to either 0 or 1. By spending more effort on the move > with higher reward, we > provide the maximum confirmation of the quality of the chosen move. If the > reward of that move starts > to drift downward, then it is good that we spent the effort. > > Another angle: we can spend time on either move A or move B, with A higher. > If A is winning, then > it is a waste of time to search B even one time. So in that case c==0 is > optimal. The harder case > is where A is losing: we have spent more time on A and it has a higher win > rate, so we would > choose move A unless something changes. There are two strategies: invest > in A to prove that it is > not as good as it looks, or invest in B to prove that it is better than it > seems. With only two move > choices, these alternatives are probably about equal. But what if we had > hundreds of alternatives? > We would have a hard time guessing the winning play. So even when move A > is losing we might > be better off investing effort to disprove it, which would allow an > alternative to rise. > > One more thought: Suppose that move A wins 56 out of 100 trials, and move B > wins 5 out of 9. > Which represents better evidence of superiority? Move A is more standard > deviations over 50%. > Does that suggest a new exploration policy? > > OK, so you don't have to worry if you set c==0. It might even be best. Just > a note: in very preliminary > experiments, c==0 is not best for Pebbles. If longer experiments confirm > that, I presume it is because > Pebbles runs on a very slow computer and searches only small trees. So > your mileage may vary. > But if c==0 tests well, then there is no reason not to use it. > > Best, > Brian > > _______________________________________________ > computer-go mailing list > computer-go@computer-go.org > http://www.computer-go.org/mailman/listinfo/computer-go/ >
_______________________________________________ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/