>... my value network was trained to tell me the game is balanced at the
>beginning...
:-)
The best training policy is to select positions that correct errors.
I used the policies below to train a backgammon NN. Together, they reduced the
expected loss of the network by 50% (cut the error rate
Finally found the problem. In the end, it was as stupid as expected:
When I pick a game for the batch creation I select randomly a limited
number of moves inside the game. In the case of the value network I use
like 8-16 moves to not overfit the data (I can't take 1 or then the I/O
operations