Re: [computer-go] Monte-Carlo Simulation Balancing

David Silver Wed, 29 Apr 2009 13:29:29 -0700

Hi Yamato,

Could you give us the source code which you used?  Your algorithm is
too complicated, so it would be very helpful if possible.

Actually I think the source code would be much harder to understand!It is written inside RLGO, and makes use of a substantial existingframework that would take a lot of effort to understand. (On aseparate note I am considering making RLGO open source at some point,but I'd prefer to spend some effort cleaning it up before making itpublic).

But I think maybe Algorithm 1 is much easier than you think:

A: Estimate value V* of every position in a training set, using deeprollouts.

B: Repeat, for each position in the training set
        1. Run M simulations, estimate value of position (call this V)

2. Run another N simulations, average the value of psi(s,a) over allpositions and moves in games that black won (call this g)

        3. Adjust parameters by alpha * (V* - V) * g

The feature vector is the set of patterns you use, with value 1 if apattern is matched and 0 otherwise. The simulation policy selectsactions in proportion to the exponentiated, weighted sum of allmatching patterns. For example let's say move a matches patterns 1 and2, move b matches patterns 1 and 3, and move c matches patterns 2 and4. Then move a would be selected with probability e^(theta1 +theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 +theta4)). The theta values are the weights on the patterns which wewould like to learn. They are the log of the Elo ratings in RemiCoulom's approach.

The only tricky part is computing the vector psi(s,a). Each componentof psi(s,a) corresponds to a particular pattern, and is the differencebetween the observed feature (i.e. whether the pattern actuallyoccurred after move a in position s) and the expected feature (theaverage value of the pattern, weighted by the probability of selectingeach action).

It's also very important to be careful about signs and the colour toplay - it's easy to make a mistake and follow the gradient in thewrong direction.

Is that any clearer?
-Dave

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] Monte-Carlo Simulation Balancing

Reply via email to