Training a policy network is simple and I have found a Residual Network with Batch Normalization works very well. However training a value network is far more challenging as I have found it indeed very easy to have overfitting, unless one uses the final territory as another prediction target. Even then, it will have difficulty in handling life-and-death because we won't have the computing resources of Tencent...
Another separated issue is calling the value network just gives the winning ratio of one board position. So if one wants to directly make moves using the value network, one has to call it for all board positions after all possible moves, which is much slower than calling the policy network (which just needs one call). Recently it occurs to me that training a "score network" may be a better choice than policy / value network. The output of the score network is very simple: it's just the winning ratio of all possible moves, same as Fig 5.a in the Nature paper. ( the pdf version of this document is at http://withablink.com/GoScoreNetwork.pdf ) The score network has four merits: (1) It can directly replace both policy and value network. (2) We can do reinforcement learning on it directly, because we can train it to fit the MCTS result. This may be better than training using policy gradient (as in the Nature paper) because the convergence to optimal-play is guaranteed (because the convergence of MCTS to optimal-play is guaranteed). (3) In fact, one can directly use it to do UCT (MCTS without rollout) and the self-improving process will be even simpler. Because calling it once gives hundreds of children nodes with winning ratio and we can simply add them to our UCT tree (as if we did the rollout) and still use the UCB and selection-expansion-simulation-backpropagation algorithm. Although one might still needs some rollout when the game is close to end (to make sure the score is correct). Some TD(0) might helps as well. (4) Although one can do (2) and (3) for the value network, it is easy to have overfitting because we are just predicting one single number. The score network is better in this aspect. The training process will be like this: (1) Initial training. Use your value network / MCTS to compute the training data for the board positions in your SGFs. (2) Fine-tuning. It might be helpful to then tune it such that it is more likely to give the correct move in your professional game SGFs, i.e. making sure those move are maximizing the winning ratio. In other words, we will be training it as if it is a policy network. I believe this will give a better starting point for the self-improving stage. One possible method is like this: If $\{p_i\}$ are the network outputs and $a$ is the desired action, then we will train $p_a$ to be $\max_i \{p_i\}$ (and probably also needs to reduce the value of other $p_i$ such that some weighted sum of all the $\{p_i\}$ is preserved). (3) Self-improving. One can even randomly generate board positions and train the network to fit the MCTS result. The correlation of the board positions will hence never be a problem. Bo _______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go