I have the impression that the value network is used to initialize the score of 
a node to, say, 70% out of N trials. Then the MCTS is trial N+1, N+2, etc. 
Still asymptotically optimal, but if the value network is accurate then you 
have a big acceleration in accuracy because the scores start from a higher 
point instead of wobbling unstably for a while.

But then I didn't follow the back-up policy. That is, if you do a search, and 
the color to move loses, but the evaluation at the leaf node was winning by 
70%, then what update is made to this node?

In MCTS, you only use the W/L value. But if you are using a value network then 
it seems inconsistent not to use the 70% in some way.

So I also have to go back to read the paper again...

-----Original Message-----
From: Computer-go [mailto:[email protected]] On Behalf Of 
Darren Cook
Sent: Sunday, March 13, 2016 2:20 PM
To: [email protected]
Subject: Re: [Computer-go] Game 4: a rare insight

> You are right, but from fig 2 of the paper can see, that mc and value 
> network should give similar results:
> 
> 70% value network should be comparable to 60-65% MC winrate from this 
> paper, usually expected around move 140 in a "human expert game" (what 
> ever this means in this figure :)

Thanks, that makes sense.

>>> Assuming that is an MCTS estimate of winning probability, that 70% 
>>> sounds high (i.e. very confident);
> 
>> That tweet says 70% is from value net, not from MCTS estimate.

I guess I need to go back and read the AlphaGo papers again; I thought it was 
still an MCTS program at the top-level, and the value network was being used to 
influence the moves the tree explores. But from this, and some other comments 
I've seen, I have the feeling I've misunderstood.

Darren




_______________________________________________
Computer-go mailing list
[email protected]
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to