On Jun 28, 2011, at 9:39 PM, Imran Hendley wrote:
Hi, long-time lurker and occasional poster here,
Thank you for the paper. I hope you don't mind me asking a few very
basic questions, since I am having trouble understanding exactly
what you are doing.
Let's say we are using a linear classifier. Then our output (the
predicted move) should look like:
argmax_i (y[i]), where y[i] = w1[i] · m1 + w2[i] · m2 + b
Where each w[i] is a weight vector for location i on the board, the
m's are the (column) input vectors (which I assume are 1 at the move
location and zero elsewhere), and b is the bias term.
There is a separate bias for each move, so b in your formula should be
b[i].
To train our classifier online, we want to do something like: (1)
Generate a prediction for a training example. (2) Calculate the
error. (3) Update the feature weights. (4) Repeat.
If I understand, online training happens during the course of one
game, as we are playing. Moreover, we are using our classifier to
generate moves to select in the first phase of our simulation, as a
replacement for MCTS, and before playouts.
Correct.
Now this is where I have to start guessing the details. Are our
training examples playouts, and is our error function just 0 if the
playout wins, and 1 if it loses?
The "correct output" is 1 if the playout wins, 0 if it loses. The
error is the difference between the correct output and the actual
output.
And as we run more playouts, the classifier will update its weights
and select a different sequence of moves in the first phase of our
simulation (analogous to selecting different paths down the search
tree based on node scores in MCTS)? And when we use up our allotted
time for one turn we just return the next move (from the current
position) that our classifier predicts, based on its current weights?
We tried this, but the classifier fluctuates quite a bit. (This is, we
think, a desirably property to keep up exploration.) Instead, we
choose as the actual move the move through which the most playouts
were played.)
The paper says we fix the number of moves we select with the
classifier before running playouts (unlike starting from the root
and expanding in MCTS). This is where things start getting really
fuzzy for me. Do we propagate the results of a playout back up this
sequence? i.e. if we get a win, do we perform updates of our
classifier for each two-move sequence in the full sequence?
Yes. The classifier therefore learns from the entire playout, not just
from moves generated by the playout. (This is vaguely analogous to
RAVE.)
I would really like to get to the deeper questions about
interpreting what is really going on, but I first need to make sure
I am on the right page here. Sincere apologies for the stupid
questions. I really hope my understanding didn't get derailed so
early on that most of my questions in this message are gibberish.
But I did want to show that I actually made a concerted effort to
understand the paper before asking what on earth it is all about!
No problem -- we look forward to any insights you can offer!
Peter Drake
http://www.lclark.edu/~drake/
_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go