Re: [Computer-go] A Linear Classifier Outperforms UCT on 9x9 Go

Peter Drake Wed, 29 Jun 2011 08:20:11 -0700

On Jun 28, 2011, at 9:39 PM, Imran Hendley wrote:

Hi, long-time lurker and occasional poster here,
Thank you for the paper. I hope you don't mind me asking a few verybasic questions, since I am having trouble understanding exactlywhat you are doing.
Let's say we are using a linear classifier. Then our output (thepredicted move) should look like:
argmax_i (y[i]), where y[i] = w1[i] · m1 + w2[i] · m2 + b
Where each w[i] is a weight vector for location i on the board, them's are the (column) input vectors (which I assume are 1 at the movelocation and zero elsewhere), and b is the bias term.

There is a separate bias for each move, so b in your formula should beb[i].

To train our classifier online, we want to do something like: (1)Generate a prediction for a training example. (2) Calculate theerror. (3) Update the feature weights. (4) Repeat.
If I understand, online training happens during the course of onegame, as we are playing. Moreover, we are using our classifier togenerate moves to select in the first phase of our simulation, as areplacement for MCTS, and before playouts.


Correct.

Now this is where I have to start guessing the details. Are ourtraining examples playouts, and is our error function just 0 if theplayout wins, and 1 if it loses?

The "correct output" is 1 if the playout wins, 0 if it loses. Theerror is the difference between the correct output and the actualoutput.

And as we run more playouts, the classifier will update its weightsand select a different sequence of moves in the first phase of oursimulation (analogous to selecting different paths down the searchtree based on node scores in MCTS)? And when we use up our allottedtime for one turn we just return the next move (from the currentposition) that our classifier predicts, based on its current weights?

We tried this, but the classifier fluctuates quite a bit. (This is, wethink, a desirably property to keep up exploration.) Instead, wechoose as the actual move the move through which the most playoutswere played.)

The paper says we fix the number of moves we select with theclassifier before running playouts (unlike starting from the rootand expanding in MCTS). This is where things start getting reallyfuzzy for me. Do we propagate the results of a playout back up thissequence? i.e. if we get a win, do we perform updates of ourclassifier for each two-move sequence in the full sequence?

Yes. The classifier therefore learns from the entire playout, not justfrom moves generated by the playout. (This is vaguely analogous toRAVE.)

I would really like to get to the deeper questions aboutinterpreting what is really going on, but I first need to make sureI am on the right page here. Sincere apologies for the stupidquestions. I really hope my understanding didn't get derailed soearly on that most of my questions in this message are gibberish.But I did want to show that I actually made a concerted effort tounderstand the paper before asking what on earth it is all about!


No problem -- we look forward to any insights you can offer!

Peter Drake
http://www.lclark.edu/~drake/

_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Re: [Computer-go] A Linear Classifier Outperforms UCT on 9x9 Go

Reply via email to