The Mogo team has published that their UCT "exploration coefficient" is
zero, and further state
that this is the optimal setting for Mogo. Other studies have confirmed that
finding. Yet, the
suspicion persists that this result is somehow related to Mogo's structure,
perhaps because it
runs massively parallel or because of some twist in its internals.
 
This post provides theoretical and heuristic justification for c==0. First
the theoretical:
 
Theorem: In a finite game tree with no cycles, with binary rewards, the UCT
algorithm with c==0
converges (in the absence of computational limitations) to the game
theoretic optimal policy.
 
The proof is by induction on the depth of the tree. The base case is one ply
before a terminal state.
In the base case, UCT will eventually try a winning move if one exists.
Thereafter, UCT will repeat
that move indefinitely because there is no exploration. It follows that the
UCT value of the base
case will converge to the game theoretic value for both winning and losing
states. For the induction
step, assume that we have N > 1 plies remaining. Each trial produces a node
at depth N-1 at most.
(Note: for this to be true, we have to count ply depth according to the
longest path to terminal node.)
With sufficient numbers of trials, each of those nodes will converge to the
game-theoretic value.
This implies that if there is a winning play, it will eventually be
discovered.
 
Note that the "binary rewards" condition is required. Without it, the UCT
policy cannot know that
winning is the best possible outcome, so it would have to explore.
 
The point of this theorem is that Mogo's is safe; its exploration policy
does not prevent it from
eventually playing perfectly.
 
Now, there is no implication in this proof that the c==0 policy is
computationally optimal, or even
efficient. But we do have Mogo's experimental result, so it is worthwhile to
speculate whether
c==0 should be optimal. Some heuristic reasoning follows.
 
If UCT has to choose between trying a move that wins 55% and a move that
wins 54%, then why
*shouldn't* it try the move that wins more frequently? What we are trying to
do (at an internal node)
is to prove that our opponent's last play was losing, and we would do this
most efficiently by
sampling the move that has the highest success.
 
Another angle: at the root of the tree, we will choose the move that has the
largest number of trials.
We would like that to be a winning move. From the theorem above, we know
that the value of the
moves will converge to either 0 or 1. By spending more effort on the move
with higher reward, we
provide the maximum confirmation of the quality of the chosen move. If the
reward of that move starts
to drift downward, then it is good that we spent the effort.
 
Another angle: we can spend time on either move A or move B, with A higher.
If A is winning, then
it is a waste of time to search B even one time. So in that case c==0 is
optimal. The harder case
is where A is losing: we have spent more time on A and it has a higher win
rate, so we would
choose move A unless something changes. There are two strategies: invest in
A to prove that it is
not as good as it looks, or invest in B to prove that it is better than it
seems. With only two move
choices, these alternatives are probably about equal. But what if we had
hundreds of alternatives?
We would have a hard time guessing the winning play. So even when move A is
losing we might
be better off investing effort to disprove it, which would allow an
alternative to rise.
 
One more thought: Suppose that move A wins 56 out of 100 trials, and move B
wins 5 out of 9.
Which represents better evidence of superiority? Move A is more standard
deviations over 50%.
Does that suggest a new exploration policy?
 
OK, so you don't have to worry if you set c==0. It might even be best. Just
a note: in very preliminary
experiments, c==0 is not best for Pebbles. If longer experiments confirm
that, I presume it is because
Pebbles runs on a very slow computer and searches only small trees. So your
mileage may vary.
But if c==0 tests well, then there is no reason not to use it.
 
Best,
Brian
_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to