I wish I was smart :(
David Silver wrote:
Hi Remi,
I understood this. What I find strange is that using -1/1 should be
equivalent to using 0/1, but your algorithm behaves differently: it
ignores lost games with 0/1, and uses them with -1/1.
Imagine you add a big constant to z. One million, say. This does not
change the problem. You get either 1000000 or 1000001 as outcome of a
playout. But then, your estimate of the gradient becomes complete noise.
So maybe using -1/1 is better than 0/1 ? Since your algorithm depends
so much on the definition of the reward, there must be an optimal way
to set the reward. Or there must a better way to define an algorithm
that would not depend on an offset in the reward.
There is still something wrong that I don't understand. There may be a
way to quantify the amount of noise in the unbiased gradient estimate,
and it would depend on the average reward. Probably setting the
average reward to zero is what would minimize noise in the gradient
estimate. This is just an intuitive guess.
Okay, now I understand your point :-) It's a good question - and I think
you're right. In REINFORCE any baseline can be subtracted from the
reward, without affecting the expected gradient, but possibly reducing
its variance. The baseline leading to the best estimate is indeed the
average reward. So it should be the case that {-1,+1} would estimate
the gradient g more efficiently than {0,1}, assuming that we see similar
numbers of black wins as white wins across the training set.
So to answer your question, we can safely modify the algorithm to use
(z-b) instead of z, where b is the average reward. This would then make
the {0,1} and {-1,+1} cases equivalent (with appropriate scaling of
step-size). I don't think this would have affected the results we
presented (because all of the learning algorithms converged anyway, at
least approximately, during training) but it could be an important
modification for larger boards.
-Dave
_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/
_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/