I have the impression that the value network is used to initialize the score of a node to, say, 70% out of N trials. Then the MCTS is trial N+1, N+2, etc. Still asymptotically optimal, but if the value network is accurate then you have a big acceleration in accuracy because the scores start from a higher point instead of wobbling unstably for a while.
But then I didn't follow the back-up policy. That is, if you do a search, and the color to move loses, but the evaluation at the leaf node was winning by 70%, then what update is made to this node? In MCTS, you only use the W/L value. But if you are using a value network then it seems inconsistent not to use the 70% in some way. So I also have to go back to read the paper again... -----Original Message----- From: Computer-go [mailto:[email protected]] On Behalf Of Darren Cook Sent: Sunday, March 13, 2016 2:20 PM To: [email protected] Subject: Re: [Computer-go] Game 4: a rare insight > You are right, but from fig 2 of the paper can see, that mc and value > network should give similar results: > > 70% value network should be comparable to 60-65% MC winrate from this > paper, usually expected around move 140 in a "human expert game" (what > ever this means in this figure :) Thanks, that makes sense. >>> Assuming that is an MCTS estimate of winning probability, that 70% >>> sounds high (i.e. very confident); > >> That tweet says 70% is from value net, not from MCTS estimate. I guess I need to go back and read the AlphaGo papers again; I thought it was still an MCTS program at the top-level, and the value network was being used to influence the moves the tree explores. But from this, and some other comments I've seen, I have the feeling I've misunderstood. Darren _______________________________________________ Computer-go mailing list [email protected] http://computer-go.org/mailman/listinfo/computer-go
