Re: [computer-go] Justification for c==0

Michael Williams Tue, 28 Apr 2009 07:23:52 -0700

I will ignore Magnus's comments about AMAF while I respond directly to your 
comments.

If you do one or two simulations from a leaf node and they happen to result in losses, you would never simulate that node again? And never expand it into it'schild nodes? It is very possible that the winning move will result in a playout loss the first time it is tried.



Brian Sheppard wrote:

The Mogo team has published that their UCT "exploration coefficient" iszero, and further statethat this is the optimal setting for Mogo. Other studies have confirmedthat finding. Yet, thesuspicion persists that this result is somehow related to Mogo'sstructure, perhaps because it
runs massively parallel or because of some twist in its internals.
This post provides theoretical and heuristic justification for c==0.First the theoretical:Theorem: In a finite game tree with no cycles, with binary rewards, theUCT algorithm with c==0converges (in the absence of computational limitations) to the gametheoretic optimal policy.The proof is by induction on the depth of the tree. The base case is oneply before a terminal state.In the base case, UCT will eventually try a winning move if one exists.Thereafter, UCT will repeatthat move indefinitely because there is no exploration. It follows thatthe UCT value of the basecase will converge to the game theoretic value for both winning andlosing states. For the inductionstep, assume that we have N > 1 plies remaining. Each trial produces anode at depth N-1 at most.(Note: for this to be true, we have to count ply depth according to thelongest path to terminal node.)With sufficient numbers of trials, each of those nodes will converge tothe game-theoretic value.This implies that if there is a winning play, it will eventually bediscovered.Note that the "binary rewards" condition is required. Without it, theUCT policy cannot know that
winning is the best possible outcome, so it would have to explore.
The point of this theorem is that Mogo's is safe; its exploration policydoes not prevent it from
eventually playing perfectly.
Now, there is no implication in this proof that the c==0 policy iscomputationally optimal, or evenefficient. But we do have Mogo's experimental result, so it isworthwhile to speculate whether
c==0 should be optimal. Some heuristic reasoning follows.
If UCT has to choose between trying a move that wins 55% and a move thatwins 54%, then why*shouldn't* it try the move that wins more frequently? What we aretrying to do (at an internal node)is to prove that our opponent's last play was losing, and we would dothis most efficiently by
sampling the move that has the highest success.
Another angle: at the root of the tree, we will choose the move that hasthe largest number of trials.We would like that to be a winning move. From the theorem above, we knowthat the value of themoves will converge to either 0 or 1. By spending more effort on themove with higher reward, weprovide the maximum confirmation of the quality of the chosen move. Ifthe reward of that move starts
to drift downward, then it is good that we spent the effort.
Another angle: we can spend time on either move A or move B, with Ahigher. If A is winning, thenit is a waste of time to search B even one time. So in that case c==0 isoptimal. The harder caseis where A is losing: we have spent more time on A and it has a higherwin rate, so we wouldchoose move A unless something changes. There are two strategies: investin A to prove that it isnot as good as it looks, or invest in B to prove that it is better thanit seems. With only two movechoices, these alternatives are probably about equal. But what if we hadhundreds of alternatives?We would have a hard time guessing the winning play. So even when move Ais losing we mightbe better off investing effort to disprove it, which would allow analternative to rise.One more thought: Suppose that move A wins 56 out of 100 trials, andmove B wins 5 out of 9.Which represents better evidence of superiority? Move A is more standarddeviations over 50%.
Does that suggest a new exploration policy?
OK, so you don't have to worry if you set c==0. It might even be best.Just a note: in very preliminaryexperiments, c==0 is not best for Pebbles. If longer experiments confirmthat, I presume it is becausePebbles runs on a very slow computer and searches only small trees. Soyour mileage may vary.
But if c==0 tests well, then there is no reason not to use it.
Best,
Brian


------------------------------------------------------------------------

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Justification for c==0

Reply via email to