I will try to explain it better:

What had puzzled me until recently is how to combine two facts:

a. As Jason says,  "A single MC playout corresponds to a Bernoulli
trial with probability p" and therefore, p-hat is distributed as a binomial
which converges to the normal.

b. The playout itself is a stochastic process which has a variance. That
variance depends on the length of the playout.

These two _facts_ seem contradictory and it is not obvious how to
introduce the variance (which is crucial in any MC or even more
generally in any stochastic evaluator) in our model. The variance of
the stochastic process is not to be mixed up with the distribution of
the error of a repeated Bernoulli experiment!

The explanation "a", by itself, does not contemplate the variance of the
stochastic process. Therefore, I propose a different experiment:

Take a Bernoulli experiment of probability: p + N(0, e)
(N is some noise of expected value = 0 and standard deviation = e.
Of course, trim p + N(0, e) to the interval [0, 1].)

For very small e, e.g. e = 1/100, the result of the experiment of
p = .7 + N(0, e) is almost the same as the experiment without noise
p = .7

For very big e the result of p + N(0, e) is the same as the result for 1/2
That's because the noise is bigger than 1 most of the time, so
p + N is  0 or 1 most of the time.

In other words: the probability of winning under very noisy conditions
does no longer depend much on the initial chances, (the probability
conditioned by a move), but it depends on the stochastic
process (the playout). And because the playout is balanced
its expected value is 1/2. That's why the more important the variance,
the more p is biased towards 1/2.

Example: If the game could end in a 4 move long playout, and the first
move is studied: The probability of a win given a first "good" move
is significantly bigger than the probability after a "bad" move. But if the
playout is 1000 moves long, the probability of winning after a first
"good" move is almost the same as after a first "bad" move -> 1/2.

The p + N(0, e) model represents MC much better than the p model.

And then I go one step further: When the playouts are uniformly
random (=each legal move has the same probability of being
selected) and the number of legal moves for each side is balanced,
then: Our MC evaluator is a measure of a very trivial rate each node
has: the number of sub paths ending in win / the number of all
subpaths through that node. I call this quantity the W of a node.
W always exists and, if the tree could be searched to its end, it
could be computed directly. Because the search cannot be completed
it can be estimated by the MC playout.

In other words: An (uniformly distributed) MC playout is nothing
else but _an estimator of W_. And it is a biased estimator for the
reasons explained above.

Again:

 p-hat (= #win / #playouts) is an unbiased estimator of p.
 p is a biased towards 1/2 "estimator" of W
 W always exists (= #number of subpaths winning / # subpaths)

  (Excepts in trivial cases, # subpaths >> astronomical =>W cannot
be computed directly.)

I hope it is clear this time. For me, it makes so much sense that
it becomes obvious when understood, but my way of explaining it
is surely improvable. ;-)

Jacques.

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to