I reread the original UCT paper, to make sure that I understood why UCT works. The proof has the following outline:
1) Show that the UCB formula allocates O(log T) trials to every move, where T is the number of trials of this node. 2) Show that under conditions of "slow drift" that a process having the property above converges (recursively) to the the true optimum. It seemed to me that after proving the O(log T) property, the exact form of the UCB formula did not enter into the proof. So I expect that other exploration policies that guarantee O(log T) exploration would also be asymptotically optimal. An example of such a policy is the epsilon-greedy policy with exploration rate O(1/T). One of the problems with UCT is that its convergence rate is very slow. In the worst case, convergence for a single node can require exponential time. This is a direct consequence of the O(log T) exploration bound. Note that for a tree of depth D, the convergence bound is a "tower" of exponentials. You can vary the exploration rate by changing to O(log(T) / T), for example, which would explore more aggressively. And I expect that exploring at a rate O(1/sqrt(T)) would also asymptotically converge. The balance that you have to strike is between exploration and exploitation. The choices that a child makes affects its parent (games imitating life :-). Children can slow the flow of information to parents by exploring either too much or too little. There is a domain- dependent design decision to be made here. Note that at the root of the tree there is no parent to worry about. This post has several points: 1) The UCB formula is being used for a side-effect. Alternatives that achieve that side-effect should have comparable properties. 2) You can make design choices in the exploration/exploitation to achieve asymptotic convergence with better worst-case behavior. Best, Brian _______________________________________________ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/