I don't quite understand this.   If I try move m0, which we will assume is
the only winning move,  but on the first simulation it turns out to lose,
then from what you seem to be saying I would never try it again?

Also, which move do you try in the 56/100 example:

  A) wins 56/100 = 0.56
  B) wins  5/9     = 0.555 ...

Are you saying that you should try move A because the 0.56 > 0.55 ?

- Don




2009/4/28 Brian Sheppard <sheppar...@aol.com>

>  The Mogo team has published that their UCT "exploration coefficient" is
> zero, and further state
> that this is the optimal setting for Mogo. Other studies have confirmed
> that finding. Yet, the
> suspicion persists that this result is somehow related to Mogo's structure,
> perhaps because it
> runs massively parallel or because of some twist in its internals.
>
> This post provides theoretical and heuristic justification for c==0. First
> the theoretical:
>
> Theorem: In a finite game tree with no cycles, with binary rewards, the UCT
> algorithm with c==0
> converges (in the absence of computational limitations) to the game
> theoretic optimal policy.
>
> The proof is by induction on the depth of the tree. The base case is one
> ply before a terminal state.
> In the base case, UCT will eventually try a winning move if one exists.
> Thereafter, UCT will repeat
> that move indefinitely because there is no exploration. It follows that the
> UCT value of the base
> case will converge to the game theoretic value for both winning and losing
> states. For the induction
> step, assume that we have N > 1 plies remaining. Each trial produces a node
> at depth N-1 at most.
> (Note: for this to be true, we have to count ply depth according to the
> longest path to terminal node.)
> With sufficient numbers of trials, each of those nodes will converge to the
> game-theoretic value.
> This implies that if there is a winning play, it will eventually be
> discovered.
>
> Note that the "binary rewards" condition is required. Without it, the UCT
> policy cannot know that
> winning is the best possible outcome, so it would have to explore.
>
> The point of this theorem is that Mogo's is safe; its exploration policy
> does not prevent it from
> eventually playing perfectly.
>
> Now, there is no implication in this proof that the c==0 policy is
> computationally optimal, or even
> efficient. But we do have Mogo's experimental result, so it is worthwhile
> to speculate whether
> c==0 should be optimal. Some heuristic reasoning follows.
>
> If UCT has to choose between trying a move that wins 55% and a move that
> wins 54%, then why
> *shouldn't* it try the move that wins more frequently? What we are trying
> to do (at an internal node)
> is to prove that our opponent's last play was losing, and we would do this
> most efficiently by
> sampling the move that has the highest success.
>
> Another angle: at the root of the tree, we will choose the move that has
> the largest number of trials.
> We would like that to be a winning move. From the theorem above, we know
> that the value of the
> moves will converge to either 0 or 1. By spending more effort on the move
> with higher reward, we
> provide the maximum confirmation of the quality of the chosen move. If the
> reward of that move starts
> to drift downward, then it is good that we spent the effort.
>
> Another angle: we can spend time on either move A or move B, with A higher.
> If A is winning, then
> it is a waste of time to search B even one time. So in that case c==0 is
> optimal. The harder case
> is where A is losing: we have spent more time on A and it has a higher win
> rate, so we would
> choose move A unless something changes. There are two strategies: invest
> in A to prove that it is
> not as good as it looks, or invest in B to prove that it is better than it
> seems. With only two move
> choices, these alternatives are probably about equal. But what if we had
> hundreds of alternatives?
> We would have a hard time guessing the winning play. So even when move A
> is losing we might
> be better off investing effort to disprove it, which would allow an
> alternative to rise.
>
> One more thought: Suppose that move A wins 56 out of 100 trials, and move B
> wins 5 out of 9.
> Which represents better evidence of superiority? Move A is more standard
> deviations over 50%.
> Does that suggest a new exploration policy?
>
> OK, so you don't have to worry if you set c==0. It might even be best. Just
> a note: in very preliminary
> experiments, c==0 is not best for Pebbles. If longer experiments confirm
> that, I presume it is because
> Pebbles runs on a very slow computer and searches only small trees. So
> your mileage may vary.
> But if c==0 tests well, then there is no reason not to use it.
>
> Best,
> Brian
>
> _______________________________________________
> computer-go mailing list
> computer-go@computer-go.org
> http://www.computer-go.org/mailman/listinfo/computer-go/
>
_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to