A few more words *) Pushing this idea to the extreme, one might want to build a "Tree Network" whose output tries to somehow fit the whole Monte-Carlo Search Tree (including all the win/lose numbers etc.) for the board position. As we know a deep network can fit anything. The structure of the network requires some thinking as we certainly shouldn't directly fit the whole tree.
*) To improve the life-and-death knowledge of the network, it might help to make an very aggressive opponent (whose policy is biased towards fighting moves) in self-playing. As another example, if your network has problem with ladder / mirror-go, probably it's better to make an opponent that is fond of ladder / mirror-go moves and use the resulting MCTS result to train your network (instead of patching your code to do a ladder search). *) Could we build a distributed training project like Folding@home / mining bitcoins? Otherwise individuals / small groups won't have any chance against large companies. On 3/20/17, 03:48, "Computer-go on behalf of Bo Peng" <computer-go-boun...@computer-go.org on behalf of b...@withablink.com> wrote: >Training a policy network is simple and I have found a Residual Network >with Batch Normalization works very well. However training a value network >is far more challenging as I have found it indeed very easy to have >overfitting, unless one uses the final territory as another prediction >target. Even then, it will have difficulty in handling life-and-death >because we won't have the computing resources of Tencent... > >Another separated issue is calling the value network just gives the >winning ratio of one board position. So if one wants to directly make >moves using the value network, one has to call it for all board positions >after all possible moves, which is much slower than calling the policy >network (which just needs one call). > >Recently it occurs to me that training a "score network" may be a better >choice than policy / value network. The output of the score network is >very simple: it's just the winning ratio of all possible moves, same as >Fig 5.a in the Nature paper. > >( the pdf version of this document is at >http://withablink.com/GoScoreNetwork.pdf ) > >The score network has four merits: > >(1) It can directly replace both policy and value network. > >(2) We can do reinforcement learning on it directly, because we can train >it to fit the MCTS result. This may be better than training using policy >gradient (as in the Nature paper) because the convergence to optimal-play >is guaranteed (because the convergence of MCTS to optimal-play is >guaranteed). > >(3) In fact, one can directly use it to do UCT (MCTS without rollout) and >the self-improving process will be even simpler. Because calling it once >gives hundreds of children nodes with winning ratio and we can simply add >them to our UCT tree (as if we did the rollout) and still use the UCB and >selection-expansion-simulation-backpropagation algorithm. Although one >might still needs some rollout when the game is close to end (to make sure >the score is correct). Some TD(0) might helps as well. > >(4) Although one can do (2) and (3) for the value network, it is easy to >have overfitting because we are just predicting one single number. The >score network is better in this aspect. > >The training process will be like this: > >(1) Initial training. Use your value network / MCTS to compute the >training data for the board positions in your SGFs. > >(2) Fine-tuning. It might be helpful to then tune it such that it is more >likely to give the correct move in your professional game SGFs, i.e. >making sure those move are maximizing the winning ratio. In other words, >we will be training it as if it is a policy network. I believe this will >give a better starting point for the self-improving stage. > >One possible method is like this: If $\{p_i\}$ are the network outputs and >$a$ is the desired action, then we will train $p_a$ to be $\max_i \{p_i\}$ >(and probably also needs to reduce the value of other $p_i$ such that some >weighted sum of all the $\{p_i\}$ is preserved). > >(3) Self-improving. One can even randomly generate board positions and >train the network to fit the MCTS result. The correlation of the board >positions will hence never be a problem. > >Bo > > >_______________________________________________ >Computer-go mailing list >Computer-go@computer-go.org >http://computer-go.org/mailman/listinfo/computer-go _______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go