Re: [computer-go] Justification for c==0

2009-05-01 Thread compgo123
If one set c=0, one must do something else to balancng the width and depth. If 
not, one will get very inconsistent results. By the way I'm not sure c is the 
best way to balancing the width and depth. This could be an excellent subject 
for research.

DL


-Original Message-
From: Brian Sheppard 
To: computer-go@computer-go.org
Sent: Tue, 28 Apr 2009 8:31 am
Subject: [computer-go] Justification for c==0



The Mogo team has published that their UCT "exploration coefficient" is zero, 
and further state

that this is the optimal setting for Mogo. Other studies have confirmed that 
finding. Yet, the

suspicion persists that this result is somehow related to Mogo's structure, 
perhaps because it

runs massively parallel or because of some twist in its internals.

?

This post provides theoretical and heuristic justification for c==0. First the 
theoretical:

?

Theorem: In a finite game tree with no cycles,?with binary rewards, the UCT 
algorithm with c==0

converges (in the absence of computational limitations) to the game theoretic 
optimal policy.

?

The proof is by induction on the depth of the tree. The base case is one ply 
before a terminal state.

In the base case, UCT will eventually try a winning move if one exists. 
Thereafter, UCT will repeat

that move indefinitely because there is no exploration. It follows that the UCT 
value of the base

case will converge to the game theoretic value for both winning and losing 
states. For the induction

step, assume that we have N > 1 plies remaining. Each trial produces a node at 
depth N-1 at most.

(Note: for this to be true, we have to count ply depth according to the longest 
path to terminal node.)

With sufficient numbers of trials, each of those nodes will converge to the 
game-theoretic value.

This implies that if there is a winning play, it will eventually be discovered.

?

Note that the "binary rewards" condition is required. Without it, the UCT 
policy cannot know that

winning is the best possible outcome, so it would have to explore.

?

The point of this theorem is that Mogo's is safe; its exploration policy does 
not prevent it from

eventually playing perfectly.

?

Now, there is no implication in this proof that the c==0 policy is 
computationally optimal, or even

efficient. But we do have Mogo's experimental result, so it is worthwhile to 
speculate whether

c==0 should be optimal. Some heuristic reasoning follows.

?

If UCT has to choose between trying a move that wins 55% and a move that wins 
54%, then why

*shouldn't* it try the move that wins more frequently? What we are trying to do 
(at an internal node)

is to prove that our opponent's last play was losing, and we would do this most 
efficiently by

sampling the move that has the highest success.

?

Another angle: at the root of the tree, we will choose the move that has the 
largest number of trials.

We would like that to be a winning move. From the theorem above, we know that 
the value of the

moves will converge to either 0 or 1. By spending more effort on the move with 
higher reward, we

provide the maximum confirmation of the quality of the chosen move. If the 
reward of that move starts

to drift downward, then it is good that we spent the effort.

?

Another angle: we can spend time on either move A or move B, with A higher. If 
A is winning, then

it is a waste of time to search B even?one time. So in that case c==0 is 
optimal. The harder case

is where A is losing: we have spent more time on A and it has a higher win 
rate, so we would

choose move A unless something changes. There are two strategies: invest in A 
to prove that it is

not as good as it looks, or invest in B to prove that it is better than it 
seems. With only two move

choices, these alternatives are probably about equal. But what if we had 
hundreds of alternatives?

We would have a hard time guessing the winning play. So even when move A is 
losing we might

be better off investing effort to disprove it, which would allow an alternative 
to rise.

?

One more thought: Suppose that move A wins 56 out of 100 trials, and move B 
wins 5 out of 9.

Which represents better evidence of superiority? Move A is more standard 
deviations over 50%.

Does that suggest a new exploration policy?

?

OK, so you don't have to worry if you set c==0. It might even be best. Just a 
note: in very preliminary

experiments, c==0 is not best for Pebbles. If longer experiments confirm that, 
I presume it is because

Pebbles runs on a very slow computer and searches only small trees. So your 
mileage may vary.

But if c==0 tests well, then there is no reason not to use it.

?

Best,

Brian



___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Monte-Carlo Simulation Balancing

2009-05-01 Thread David Silver

Hi Yamato,


If M and N are the same, is there any reason to run M simulations and
N simulations separately?  What happens if you combine them and  
calculate

V and g in the single loop?


I think it gives the wrong answer to do it in a single loop. Note that  
the simulation outcomes z are used to compute both V and g, so that  
they are quite strongly correlated. In general, if random variables X  
and Y are correlated then E[X]E[Y] != E[XY].


A simple example of this problem would be sampling E[X]E[X] for some  
random variable X. Let's say X is actually just noise with mean zero.  
If you just take one sample x1, then x1*x1 is always +ve, and you  
would estimate E[X]E[X]=E[X^2]>0. But if you take two independent  
samples x1 and x2, then x1*x2 would be +ve half the time and -ve half  
the time, and you would correctly estimate E[X]E[X]=0.


-Dave

___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Justification for c==0

2009-05-01 Thread Olivier Teytaud
> Theorem: In a finite game tree with no cycles, with binary rewards, the UCT
> algorithm with c==0
> converges (in the absence of computational limitations) to the game
> theoretic optimal policy.
>

This is also tree with RAVE instead of UCT, if you ensure that RAVE values
are never below some positive constant (this was posted in discussions
between Sylvain and me a long time ago in this mailing list, but never
published I guess).

By the way:
- I guess there are several codes in which c=0, or c>0 but with no
significant
  improvement over c=0. If you want to publish scientific papers, claim that
c>0;
  UCT is so widely accepted that your paper is more likely to
  be accepted with c>0   :-)
- even withou parallelization, mogo is better with constant 0.
- on the other hand, and in particular for stochastic games with one player
(which
   is industrially quite relevant), I find UCT quite good. Also, progressive
widening
   as published by Rémi is efficient in this one-player setting (but with
different
   constants). I'm definitely convinced that UCT provides a very great
   improvement, e.g. in front of Bellman's dynamic programming, in many
cases.
- Another point is that with the pattern-based exploration term (not
UCT-like), we
   have an exploration term; but not at all UCT-like (something like
   Heuristic(move)/log(2+numberOfSimulations) ).

A somewhat philosophical point related to the fact that UCT is not so
efficient for Go, is that clear sound algorithms are probably not the main
elements in Go; plenty of parameters (the coefficients of patterns, the
choice of Monte-Carlo player) are necessary and do not follow short and
sound rules. If you want to publish scientific papers,
do not say that - you should preferably hide the details, and show only a
small and well chosen part of your results, so that people can believe that
a small sound idea is the explanation for the fact that the program works.
But, in the particular case of Go,
and somewhat disappointingly, I'm not sure that reality is like that. I
prefer
nonlinear optimization or applications of UCT in one-player games or
applications for this point of view - there, sound ideas usually work and
outperform
dirty tricks :-)
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

[computer-go] Incorporating a prior estimate

2009-05-01 Thread Brian Sheppard
In reading Sylvain Gelly's thesis, it seemed that incorporating a prior
estimate of winning percentage is
very important to the practical strength of Mogo.
 
E.g., with 1 trials, Mogo achieved 2110 rating on CGOS, whereas my
program attempts to
reproduce existing research and is (maybe) 1900 rating with 2 to 3
trials. The use of a
prior is an important difference, so I want to understand it more deeply.
 
Some questions:
 
1) When you create a node, do you initialize
 
number of simulations = C
number of wins = C * PriorEstimate()
 
where C is a constant > 0? In Sylvain's thesis, the optimal C = 50,
suggesting that
incorporating a prior estimate was the equivalent of 50 UCT-RAVE trials.
 
2) Two variations were suggested. In one variation, the prior was
incorporated into the UCT
statistics of the node. In the other, the prior was incorporated into the
RAVE statistics. Charts
in the thesis do not confirm which was actually being measured. In some
cases it appears to
be the UCT version, but elsewhere it seems to be the RAVE version. Does
anyone know
what was really done?
 
3) Elsewhere I have seen information suggesting that Mogo initializes RAVE
statistics to
implement progressive widening. Does that conflict with the use of a prior
for RAVE initialization,
or is it in addition to the use of a prior for RAVE initialization?
 
4) When creating a node, do you estimate the prior for that node , or for
that node's children?
 
Thanks in advance,
Brian
 
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/