Hi, I couldn't get positive experiment results on Ray.
Rn's network structure of V and W are similar and share parameters, but only final convolutional layer are different. I trained Rn's network to minimize MSE of V(s) + W(s). It uses only KGS and GoGoD data sets, no self play with RL policy. When trained only W(s), the network overfits, but to train V(s) + W(s) same time prevents overfitting. But I have no idea about how to use V(s) or v(s) in MCTS. Rn.3.0-4c plays with W(s): winning rate. http://www.yss-aya.com/19x19/cgos/cross/Rn.3.0-4c.html 3394 elo Rn.3.1-4c plays with V(s): sum of ownership. bit weaker # MCTS part is tuned for W(s) now, so something may be wrong. http://www.yss-aya.com/cgos/19x19/cross/Rn.3.1-4c.html 3218 elo zakki 2017年1月11日(水) 19:49 Bo Peng <b...@withablink.com>: > Hi Remi, > > Thanks for sharing your experience. > > As I am writing this, it seems there could be a third method: the perfect > value function shall have the minimax property in the obvious way. So we > can train our value function to satisfy the minimax property as well. In > fact, we can train it such that a shallow-level MCTS gives as close a > result as a deeper-level MCTS. This can be regarded as some kind of > bootstrapping. > > Wonder if you have tried this. Seems might be a natural idea... > > Bo > > On 1/11/17, 18:35, "Computer-go on behalf of Rémi Coulom" > <computer-go-boun...@computer-go.org on behalf of remi.cou...@free.fr> > wrote: > > >Hi, > > > >Thanks for sharing your idea. > > > >In my experience it is rarely efficient to train value functions from > >very short term data (ie, next move). TD(lambda), or training from the > >final outcome of the game is often better, because it uses a longer > >horizon. But of course, it is difficult to tell without experiments > >whether your idea would work or not. The advantage of your ideas is that > >you can collect a lot of training data more easily. > > > >Rémi > > > >----- Mail original ----- > >De: "Bo Peng" <b...@withablink.com> > >À: computer-go@computer-go.org > >Envoyé: Mardi 10 Janvier 2017 23:25:19 > >Objet: [Computer-go] Training the value network (a possibly more > >efficient approach) > > > > > >Hi everyone. It occurs to me there might be a more efficient method to > >train the value network directly (without using the policy network). > > > > > >You are welcome to check my method: > >http://withablink.com/GoValueFunction.pdf > > > > > >Let me know if there is any silly mistakes :) > > > > > _______________________________________________ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go
_______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go