Training a policy network is simple and I have found a Residual Network
with Batch Normalization works very well. However training a value network
is far more challenging as I have found it indeed very easy to have
overfitting, unless one uses the final territory as another prediction
target. Even then, it will have difficulty in handling life-and-death
because we won't have the computing resources of Tencent...

Another separated issue is calling the value network just gives the
winning ratio of one board position. So if one wants to directly make
moves using the value network, one has to call it for all board positions
after all possible moves, which is much slower than calling the policy
network (which just needs one call).

Recently it occurs to me that training a "score network" may be a better
choice than policy / value network. The output of the score network is
very simple: it's just the winning ratio of all possible moves, same as
Fig 5.a in the Nature paper.

( the pdf version of this document is at
http://withablink.com/GoScoreNetwork.pdf )

The score network has four merits:

(1) It can directly replace both policy and value network.

(2) We can do reinforcement learning on it directly, because we can train
it to fit the MCTS result. This may be better than training using policy
gradient (as in the Nature paper) because the convergence to optimal-play
is guaranteed (because the convergence of MCTS to optimal-play is
guaranteed).

(3) In fact, one can directly use it to do UCT (MCTS without rollout) and
the self-improving process will be even simpler. Because calling it once
gives hundreds of children nodes with winning ratio and we can simply add
them to our UCT tree (as if we did the rollout) and still use the UCB and
selection-expansion-simulation-backpropagation algorithm. Although one
might still needs some rollout when the game is close to end (to make sure
the score is correct). Some TD(0) might helps as well.

(4) Although one can do (2) and (3) for the value network, it is easy to
have overfitting because we are just predicting one single number. The
score network is better in this aspect.

The training process will be like this:

(1) Initial training. Use your value network / MCTS to compute the
training data for the board positions in your SGFs.

(2) Fine-tuning. It might be helpful to then tune it such that it is more
likely to give the correct move in your professional game SGFs, i.e.
making sure those move are maximizing the winning ratio. In other words,
we will be training it as if it is a policy network. I believe this will
give a better starting point for the self-improving stage.

One possible method is like this: If $\{p_i\}$ are the network outputs and
$a$ is the desired action, then we will train $p_a$ to be $\max_i \{p_i\}$
(and probably also needs to reduce the value of other $p_i$ such that some
weighted sum of all the $\{p_i\}$ is preserved).

(3) Self-improving. One can even randomly generate board positions and
train the network to fit the MCTS result. The correlation of the board
positions will hence never be a problem.

Bo


_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to