Re: [Computer-go] Training a "Score Network" in Monte-Carlo Tree Search

Bo Peng Sun, 19 Mar 2017 19:57:07 -0700

A few more words

*) Pushing this idea to the extreme, one might want to build a "Tree
Network" whose output tries to somehow fit the whole Monte-Carlo Search
Tree (including all the win/lose numbers etc.) for the board position. As
we know a deep network can fit anything. The structure of the network
requires some thinking as we certainly shouldn't directly fit the whole
tree.


*) To improve the life-and-death knowledge of the network, it might help
to make an very aggressive opponent (whose policy is biased towards
fighting moves) in self-playing. As another example, if your network has
problem with ladder / mirror-go, probably it's better to make an opponent
that is fond of ladder / mirror-go moves and use the resulting MCTS result
to train your network (instead of patching your code to do a ladder
search).

*) Could we build a distributed training project like Folding@home /
mining bitcoins? Otherwise individuals / small groups won't have any
chance against large companies.

On 3/20/17, 03:48, "Computer-go on behalf of Bo Peng"
<computer-go-boun...@computer-go.org on behalf of b...@withablink.com> wrote:

>Training a policy network is simple and I have found a Residual Network
>with Batch Normalization works very well. However training a value network
>is far more challenging as I have found it indeed very easy to have
>overfitting, unless one uses the final territory as another prediction
>target. Even then, it will have difficulty in handling life-and-death
>because we won't have the computing resources of Tencent...
>
>Another separated issue is calling the value network just gives the
>winning ratio of one board position. So if one wants to directly make
>moves using the value network, one has to call it for all board positions
>after all possible moves, which is much slower than calling the policy
>network (which just needs one call).
>
>Recently it occurs to me that training a "score network" may be a better
>choice than policy / value network. The output of the score network is
>very simple: it's just the winning ratio of all possible moves, same as
>Fig 5.a in the Nature paper.
>
>( the pdf version of this document is at
>http://withablink.com/GoScoreNetwork.pdf )
>
>The score network has four merits:
>
>(1) It can directly replace both policy and value network.
>
>(2) We can do reinforcement learning on it directly, because we can train
>it to fit the MCTS result. This may be better than training using policy
>gradient (as in the Nature paper) because the convergence to optimal-play
>is guaranteed (because the convergence of MCTS to optimal-play is
>guaranteed).
>
>(3) In fact, one can directly use it to do UCT (MCTS without rollout) and
>the self-improving process will be even simpler. Because calling it once
>gives hundreds of children nodes with winning ratio and we can simply add
>them to our UCT tree (as if we did the rollout) and still use the UCB and
>selection-expansion-simulation-backpropagation algorithm. Although one
>might still needs some rollout when the game is close to end (to make sure
>the score is correct). Some TD(0) might helps as well.
>
>(4) Although one can do (2) and (3) for the value network, it is easy to
>have overfitting because we are just predicting one single number. The
>score network is better in this aspect.
>
>The training process will be like this:
>
>(1) Initial training. Use your value network / MCTS to compute the
>training data for the board positions in your SGFs.
>
>(2) Fine-tuning. It might be helpful to then tune it such that it is more
>likely to give the correct move in your professional game SGFs, i.e.
>making sure those move are maximizing the winning ratio. In other words,
>we will be training it as if it is a policy network. I believe this will
>give a better starting point for the self-improving stage.
>
>One possible method is like this: If $\{p_i\}$ are the network outputs and
>$a$ is the desired action, then we will train $p_a$ to be $\max_i \{p_i\}$
>(and probably also needs to reduce the value of other $p_i$ such that some
>weighted sum of all the $\{p_i\}$ is preserved).
>
>(3) Self-improving. One can even randomly generate board positions and
>train the network to fit the MCTS result. The correlation of the board
>positions will hence never be a problem.
>
>Bo
>
>
>_______________________________________________
>Computer-go mailing list
>Computer-go@computer-go.org
>http://computer-go.org/mailman/listinfo/computer-go


_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Training a "Score Network" in Monte-Carlo Tree Search

Reply via email to