Hi Yamato,
Could you give us the source code which you used? Your algorithm is
too complicated, so it would be very helpful if possible.
Actually I think the source code would be much harder to understand!
It is written inside RLGO, and makes use of a substantial existing
framework that would take a lot of effort to understand. (On a
separate note I am considering making RLGO open source at some point,
but I'd prefer to spend some effort cleaning it up before making it
public).
But I think maybe Algorithm 1 is much easier than you think:
A: Estimate value V* of every position in a training set, using deep
rollouts.
B: Repeat, for each position in the training set
1. Run M simulations, estimate value of position (call this V)
2. Run another N simulations, average the value of psi(s,a) over all
positions and moves in games that black won (call this g)
3. Adjust parameters by alpha * (V* - V) * g
The feature vector is the set of patterns you use, with value 1 if a
pattern is matched and 0 otherwise. The simulation policy selects
actions in proportion to the exponentiated, weighted sum of all
matching patterns. For example let's say move a matches patterns 1 and
2, move b matches patterns 1 and 3, and move c matches patterns 2 and
4. Then move a would be selected with probability e^(theta1 +
theta2) / (e^(theta1 + theta2) + e^(theta1 + theta3) + e^(theta2 +
theta4)). The theta values are the weights on the patterns which we
would like to learn. They are the log of the Elo ratings in Remi
Coulom's approach.
The only tricky part is computing the vector psi(s,a). Each component
of psi(s,a) corresponds to a particular pattern, and is the difference
between the observed feature (i.e. whether the pattern actually
occurred after move a in position s) and the expected feature (the
average value of the pattern, weighted by the probability of selecting
each action).
It's also very important to be careful about signs and the colour to
play - it's easy to make a mistake and follow the gradient in the
wrong direction.
Is that any clearer?
-Dave
_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/