Based on my analysis, estimating a moves probability of winning by
taking the number of winning simulations (w) and dividing it by the
total number of simulations (n) is actually biased. I tried to break
this e-mail up into sections for easy digestion by the various people
who might read this. The sections are: overview, scope, assumptions,
and derivation. I hope the first two are enough to given a flavor of
what the e-mail is about. The 3rd (assumptions) I expect to spark some
debate. The derivation will probably only be skimmed by most.
OVERVIEW
The generalized probability of winning is (w+alpha)/(n+alpha+beta).
Under the simplest assumption alpha = beta = 1, and the result becomes
(w+1)/(n+2). Notice how with no simulations, there's a 50/50 chance of
a move being good or bad and how one (good or bad) simulation doesn't
force the estimation of winning to one extreme.
SCOPE
These results are only relevant when doing non-uniform sampling (such
as UCT) and where the sample size is relatively small. Under uniform
sampling, the move with the most wins is still best. With non-uniform
sampling, a large n (and likely w), the extra factors of alpha and beta
get washed out and these results become less relevant.
ASSUMPTIONS
To use these results, you must make some assumption about the
underlying distribution of a move's probability of winning. If nothing
is known a priori, then assuming no probability is any more likely than
any other seems safe (resulting in uniform distribution which is also a
beta distribution with alpha=beta=1). For mature MC bots, I assume this
is just curve fitting to historically collected data to a beta
distribution (yielding an alpha and beta). Using the example pictures
at http://en.wikipedia.org/wiki/Beta_distribution and what I remember
from this mailing list, alpha=2 and beta=5 seems to peak in about the
right area ( (alpha-1)/(alpha+beta-2) = 20% ). I doubt that's the
correct distribution, but will have to default to the MC experts on this
list.
DERIVATION
The underlying approach is to calculate the probability that the given
sample came from a move with a particular probability of winning (mu)
and then do the integral of mu*p(mu | w,n) for all mu (0% to 100%).
(where p(mu | w,n) = probability of mu given the observed w and n).
p(mu | w,n) = p(w | mu,n) * p(mu) / p(w | n). Thankfully, p(w | n) can
be written off as a normalization constant: p(w | n) = integral of p(w |
mu,n)*p(mu) for all mu.
p(w | mu,n) can be found in any textbook as (mu^w)*(1-mu)^(n-w). Even
when assuming p(mu) = uniform, calculation of the normalization constant
looked daunting. Cheating with an online integrator lead me to the beta
distribution.
Notice that the integral of mu*p(mu | w,n) boils down to finding the
mean of p(mu | w,n). The solution for p(mu)=constant becomes finding th
mean of a beta distribution with alpha=w+1 and beta = (w-n+1) and yields
the simplest solution given above.
Assuming a distribution for p(mu) could lead to some very messy
calculus. The best generalized solution I can come up with is when you
assume p(mu) follows a beta distribution because then p(mu | w,n)*p(mu)
is still a beta distribution. This yields what I put at the top of this
e-mail.
If one limits p(mu) to a polynomial, integrals with each term in the
polynomial become small beta distributions. For integer powers, the
result is a ratio of products of factorials. The mean cancels out
nicely, but once p(mu) is the sum of multiple beta distributions,
convenient cancellations are not guaranteed. I did not try to work out
any examples since I don't yet know that curve fitting to a beta
distribution is unreasonable.
Maybe other simple solutions exist, but I'll leave that up to those
who are even bigger math geeks than I am ;)
_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/