We place a lot of faith in RAVE. Our faith is justified 99% of the time. But
when does

faith become overconfidence?

 

Perhaps we just need an exploration approach that is more "fail-safe." For
example, an

epsilon-greedy strategy to be applied in states that have a lot of
exploration but still some

moves that have never been examined.

 

The bad case occurs when we think a node is losing, but there is a winning
move. 

So let's wait for the score of the highest move to fall below 50% when using
UCT+RAVE.

Perhaps we wait for the highest UCB to fall below 50%, if we want to be
really conservative.

Once a node passes that threshold, I am open to alternative exploration
strategies.

 

The problem is that there are three cases:

 

1)       The node is losing.

2)       The node is winning, and we have not examined the winning move.

3)       The node is winning, and the winning move hasn't been tried enough
to find the win.

 

Now, in the first case we have nothing to lose by exploring alternatives,
since all are losing.

In the other two cases, though, there is a risk/reward ratio. If the winning
move has not been

searched, then we can benefit from trying new moves. But if the winning move
is being tried

but has not been searched deeply enough, then we slow ourselves down by
trying new moves.

 

RAVE balances the risks by strictly following the rewards from experience.
And in most

cases this is valid. But it is quite rigid.

 

I am not trying to be theoretically solid, in the way that RAVE is an
estimate of winning

percentage and can therefore be used as a proxy in the UCT formula. Instead,
I would like

to have a better worst-case.

 

To review, the worst-case behavior of UCT is to allocate O(log T) trials to
each move, where

T trials have been done at this node. Since Log T increases without limit,
eventually this

node converges to the game-theoretic value. However, the worst-case speed of
convergence

is exponential. And that is just for one node. In a UCT tree, the worst-case
speed of

convergence is a tower of exponentials. So: really, really bad convergence.

 

In the case of a node that is believed to be bad anyway, can we change the
policy so that

a fraction of effort goes to alternatives?

 

Perhaps a policy of allocating 10% of trials to the least-taken alternative,
after the node is

determined to be losing?

 

 

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to