Hi,
1) Simultaneous policy and value functions reinforcement learning by MCTS-TD-Lambda ? What is a good policy network, from a 'Policy & Value - MCTS' (PV-MCTS) point of view (i.e. in Alphago implementation) ? Refering to Silver's paper terminology and results, greedy policy using RL Policy Network beated greedy policy using SL Policy Network, but PV-MCTS performed better when used with SL Policy Networks than with RL-Policy Network. Authors hypothetized that it is "presumably because humans select a diverse beam of promising moves, whereas RL optimizes for the single best move". Tree-search is (in practice) necessary to discover what cannot be 'seen' immediately by a value networks from root node but will become 'clearer' later on, when evaluating the leaf nodes. Thus, one quality of a policy function to be used to bias the search in a MCTS is a good balance between 'sharpness' (being selective) and 'open-mindness' (giving a chance to some low-value moves which could turn to be important; avoid blind spot).Silver's paper does not propose a RL method for improving the policy network to be used in the PV-MCTS beyond its initial SL from human games. The RL policy network in only used train the value network. Value function RL using n-ply mini-max search or TD approach combined with tree search have been described for long (TD-Leaf e.g.) but I have failed to find a paper proposing direct RL of policy network from tree search results, more particularly from a MCTS. Since policy function is used to bias selection phase of the MCTS, which after a while will be dominated by the backup action values, I have the (naive ?) feeling that a good policy function, for use in a PV-MCTS, should predict prior probabilities as close as possible to action values updated after some search budget (let say 10000 nodes tree). I.e. predict the future action value distribution (subject to a softmax conversion). And, conversely, that action values updated after some search budget could be used for training the policy function. Action values would be converted into revised prior probabilities and used as target for the policy network, using a softmax function with adequate temperature parameter. In the same cycle, value function could be trained by TD method by comparing root backup action value vs value function value when applied to root position (usually not done in std MCTS). I'm not at all an expert in a field, not even a computer scientist. Coudld someone direct me to litterature exploring this idea or explaining why it doesnt't work in practice ? 2) PV-MCTS with Policy Network temperature gradient ?: Another, unrelated, (and also naive) question to MCTS & NN afficionados: A move given a very low prior probability by the policy network will not be explored at all or too late in the search to become the most visited node. This can create blind spots (see Lee Sedol's 'God' move 78, having 1/100000 prior value according to Aja). Blind splots (of the policy network) in positions far in the tree from root node are probably less harmful than a blind spots in the first level child nodes. I'm wondering if someone has ever considered using a gradient of temperature, in the softmax layer of the policy network, with temperature parameter varying with depth in the tree, so that the search is broader in the first levels and becomes narrow in the deepest levels (ultimately, it would turn the search into rollout to the end of the game for deepest nodes). Temperature and prior values for a given node would be revised as the game progresses and the depth of that node in the tree is reduced. Thus, only the last layer of the NN would need recalculation and this could be done by CPU and not GPU, as part of the MCTS management). But the price paid for this braoadening of the tree in its upper part might too high and detremental to the overall MCTS strengh. After all, God moves are not that common ;-) Thanks, Patrick -------- Message d'origine -------- De : computer-go-requ...@computer-go.org Date : 11/01/2017 13:00 (GMT+01:00) À : computer-go@computer-go.org Objet : Computer-go Digest, Vol 84, Issue 24 Send Computer-go mailing list submissions to computer-go@computer-go.org To subscribe or unsubscribe via the World Wide Web, visit http://computer-go.org/mailman/listinfo/computer-go or, via email, send a message with subject or body 'help' to computer-go-requ...@computer-go.org You can reach the person managing the list at computer-go-ow...@computer-go.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Computer-go digest..." Today's Topics: 1. Re: Training the value network (a possibly more efficient approach) (Rémi Coulom) 2. Re: Training the value network (a possibly more efficient approach) (Bo Peng) ---------------------------------------------------------------------- Message: 1 Date: Wed, 11 Jan 2017 11:35:41 +0100 (CET) From: Rémi Coulom <remi.cou...@free.fr> To: computer-go@computer-go.org Subject: Re: [Computer-go] Training the value network (a possibly more efficient approach) Message-ID: <1202692593.296924720.1484130941650.javamail.r...@spooler6-g27.priv.proxad.net> Content-Type: text/plain; charset=utf-8 Hi, Thanks for sharing your idea. In my experience it is rarely efficient to train value functions from very short term data (ie, next move). TD(lambda), or training from the final outcome of the game is often better, because it uses a longer horizon. But of course, it is difficult to tell without experiments whether your idea would work or not. The advantage of your ideas is that you can collect a lot of training data more easily. Rémi ----- Mail original ----- De: "Bo Peng" <b...@withablink.com> À: computer-go@computer-go.org Envoyé: Mardi 10 Janvier 2017 23:25:19 Objet: [Computer-go] Training the value network (a possibly more efficient approach) Hi everyone. It occurs to me there might be a more efficient method to train the value network directly (without using the policy network). You are welcome to check my method: http://withablink.com/GoValueFunction.pdf Let me know if there is any silly mistakes :) _______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ------------------------------ Message: 2 Date: Wed, 11 Jan 2017 18:48:59 +0800 From: Bo Peng <b...@withablink.com> To: <computer-go@computer-go.org> Subject: Re: [Computer-go] Training the value network (a possibly more efficient approach) Message-ID: <d49c2db7.6d...@withablink.com> Content-Type: text/plain; charset="ISO-8859-1" Hi Remi, Thanks for sharing your experience. As I am writing this, it seems there could be a third method: the perfect value function shall have the minimax property in the obvious way. So we can train our value function to satisfy the minimax property as well. In fact, we can train it such that a shallow-level MCTS gives as close a result as a deeper-level MCTS. This can be regarded as some kind of bootstrapping. Wonder if you have tried this. Seems might be a natural idea... Bo On 1/11/17, 18:35, "Computer-go on behalf of Rémi Coulom" <computer-go-boun...@computer-go.org on behalf of remi.cou...@free.fr> wrote: >Hi, > >Thanks for sharing your idea. > >In my experience it is rarely efficient to train value functions from >very short term data (ie, next move). TD(lambda), or training from the >final outcome of the game is often better, because it uses a longer >horizon. But of course, it is difficult to tell without experiments >whether your idea would work or not. The advantage of your ideas is that >you can collect a lot of training data more easily. > >Rémi > >----- Mail original ----- >De: "Bo Peng" <b...@withablink.com> >À: computer-go@computer-go.org >Envoyé: Mardi 10 Janvier 2017 23:25:19 >Objet: [Computer-go] Training the value network (a possibly more >efficient approach) > > >Hi everyone. It occurs to me there might be a more efficient method to >train the value network directly (without using the policy network). > > >You are welcome to check my method: >http://withablink.com/GoValueFunction.pdf > > >Let me know if there is any silly mistakes :) > ------------------------------ Subject: Digest Footer _______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go ------------------------------ End of Computer-go Digest, Vol 84, Issue 24 *******************************************
_______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go