Hi,

I couldn't get positive experiment results on Ray.

Rn's network structure of V and W are similar and share parameters,
but only final convolutional layer are different.
I trained Rn's network to minimize MSE of V(s) + W(s).
It uses only KGS and GoGoD data sets, no self play with RL policy.
When trained only W(s), the network overfits, but to train V(s) + W(s) same
time
prevents overfitting.
But I have no idea about how to use V(s) or v(s) in MCTS.

Rn.3.0-4c plays with W(s): winning rate.
http://www.yss-aya.com/19x19/cgos/cross/Rn.3.0-4c.html
3394 elo

Rn.3.1-4c plays with V(s): sum of ownership. bit weaker
# MCTS part is tuned for W(s) now, so something may be wrong.
http://www.yss-aya.com/cgos/19x19/cross/Rn.3.1-4c.html
3218 elo

zakki

2017年1月11日(水) 19:49 Bo Peng <b...@withablink.com>:

> Hi Remi,
>
> Thanks for sharing your experience.
>
> As I am writing this, it seems there could be a third method: the perfect
> value function shall have the minimax property in the obvious way. So we
> can train our value function to satisfy the minimax property as well. In
> fact, we can train it such that a shallow-level MCTS gives as close a
> result as a deeper-level MCTS. This can be regarded as some kind of
> bootstrapping.
>
> Wonder if you have tried this. Seems might be a natural idea...
>
> Bo
>
> On 1/11/17, 18:35, "Computer-go on behalf of Rémi Coulom"
> <computer-go-boun...@computer-go.org on behalf of remi.cou...@free.fr>
> wrote:
>
> >Hi,
> >
> >Thanks for sharing your idea.
> >
> >In my experience it is rarely efficient to train value functions from
> >very short term data (ie, next move). TD(lambda), or training from the
> >final outcome of the game is often better, because it uses a longer
> >horizon. But of course, it is difficult to tell without experiments
> >whether your idea would work or not. The advantage of your ideas is that
> >you can collect a lot of training data more easily.
> >
> >Rémi
> >
> >----- Mail original -----
> >De: "Bo Peng" <b...@withablink.com>
> >À: computer-go@computer-go.org
> >Envoyé: Mardi 10 Janvier 2017 23:25:19
> >Objet: [Computer-go] Training the value network (a possibly more
> >efficient approach)
> >
> >
> >Hi everyone. It occurs to me there might be a more efficient method to
> >train the value network directly (without using the policy network).
> >
> >
> >You are welcome to check my method:
> >http://withablink.com/GoValueFunction.pdf
> >
> >
> >Let me know if there is any silly mistakes :)
> >
>
>
> _______________________________________________
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to