Re: [computer-go] Monte-Carlo Simulation Balancing

Michael Williams Thu, 30 Apr 2009 13:00:24 -0700

I wish I was smart   :(


David Silver wrote:

Hi Remi,
I understood this. What I find strange is that using -1/1 should beequivalent to using 0/1, but your algorithm behaves differently: itignores lost games with 0/1, and uses them with -1/1.
Imagine you add a big constant to z. One million, say. This does notchange the problem. You get either 1000000 or 1000001 as outcome of aplayout. But then, your estimate of the gradient becomes complete noise.
So maybe using -1/1 is better than 0/1 ? Since your algorithm dependsso much on the definition of the reward, there must be an optimal wayto set the reward. Or there must a better way to define an algorithmthat would not depend on an offset in the reward.
There is still something wrong that I don't understand. There may be away to quantify the amount of noise in the unbiased gradient estimate,and it would depend on the average reward. Probably setting theaverage reward to zero is what would minimize noise in the gradientestimate. This is just an intuitive guess.
Okay, now I understand your point :-) It's a good question - and I thinkyou're right. In REINFORCE any baseline can be subtracted from thereward, without affecting the expected gradient, but possibly reducingits variance. The baseline leading to the best estimate is indeed theaverage reward. So it should be the case that {-1,+1} would estimatethe gradient g more efficiently than {0,1}, assuming that we see similarnumbers of black wins as white wins across the training set.
So to answer your question, we can safely modify the algorithm to use(z-b) instead of z, where b is the average reward. This would then makethe {0,1} and {-1,+1} cases equivalent (with appropriate scaling ofstep-size). I don't think this would have affected the results wepresented (because all of the learning algorithms converged anyway, atleast approximately, during training) but it could be an importantmodification for larger boards.
-Dave

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/


_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Monte-Carlo Simulation Balancing

Reply via email to