Hi,

1) Simultaneous policy and value functions reinforcement learning by 
MCTS-TD-Lambda ?

What is a good policy network, from a 'Policy & Value - MCTS' (PV-MCTS) point 
of view (i.e. in Alphago implementation) ?

Refering to Silver's paper terminology and results, greedy policy using RL 
Policy Network beated greedy policy using SL Policy Network, but PV-MCTS 
performed better when used with SL Policy Networks than with RL-Policy Network. 
Authors hypothetized that it is "presumably because humans select a diverse 
beam of promising moves, whereas RL optimizes for the single best move".

Tree-search is (in practice) necessary to discover what cannot be 'seen' 
immediately by a value networks from root node but will become 'clearer' later 
on, when evaluating the leaf nodes. Thus, one quality of a policy function to 
be used to bias the search in a MCTS is a good balance between 'sharpness' 
(being selective) and 'open-mindness' (giving a chance to some low-value moves 
which could turn to be important; avoid blind spot).Silver's paper does not 
propose a RL method for improving the policy network to be used in the PV-MCTS 
beyond its initial SL from human games. The RL policy network in only used 
train the value network.

Value function RL using n-ply mini-max search or TD approach combined with tree 
search have been described for long (TD-Leaf e.g.) but I have failed to find a 
paper proposing direct RL of policy network from tree search results, more 
particularly from a MCTS.

Since policy function is used to bias selection phase of the MCTS, which after 
a while will be dominated by the backup action values, I have the (naive ?) 
feeling that a good policy function, for use in a PV-MCTS, should predict prior 
probabilities as close as possible to action values updated after some search 
budget (let say 10000 nodes tree). I.e. predict the future action value 
distribution (subject to a softmax conversion). And, conversely, that action 
values updated after some search budget could be used for training the policy 
function. Action values would be converted into revised prior probabilities and 
used as target for the policy network, using a softmax function with adequate 
temperature parameter.

In the same cycle, value function could be trained by TD method by comparing 
root backup action value vs value function value when applied to root position 
(usually not done in std MCTS).

I'm not at all an expert in a field, not even a computer scientist. Coudld 
someone direct me to litterature exploring this idea or explaining why it 
doesnt't work in practice ?


2) PV-MCTS with Policy Network temperature gradient ?:

Another, unrelated, (and also naive) question to MCTS & NN afficionados:

A move given a very low prior probability by the policy network will not be 
explored at all or too late in the search to become the most visited node. This 
can create blind spots (see Lee Sedol's 'God' move 78, having 1/100000 prior 
value according to Aja). Blind splots (of the policy network) in positions far 
in the tree from root node are probably less harmful than a blind spots in the 
first level child nodes.

I'm wondering  if someone has ever considered using a gradient of temperature, 
in the softmax layer of the policy network,  with temperature parameter varying 
with depth in the tree, so that the search is broader in the first levels and 
becomes narrow in the deepest levels (ultimately, it would turn the search into 
rollout to the end of the game for deepest nodes). Temperature and prior values 
for a given node would be revised as the game progresses and the depth of that 
node in the tree is reduced. Thus, only the last layer of the NN would need 
recalculation and this could be done by CPU and not GPU, as part of the MCTS 
management). But the price paid for this braoadening of the tree in its upper 
part might too high and detremental to the overall MCTS strengh. After all, God 
moves are not that common ;-)

Thanks,
Patrick

-------- Message d'origine --------
De : computer-go-requ...@computer-go.org 
Date : 11/01/2017  13:00  (GMT+01:00) 
À : computer-go@computer-go.org 
Objet : Computer-go Digest, Vol 84, Issue 24 

Send Computer-go mailing list submissions to
        computer-go@computer-go.org

To subscribe or unsubscribe via the World Wide Web, visit
        http://computer-go.org/mailman/listinfo/computer-go
or, via email, send a message with subject or body 'help' to
        computer-go-requ...@computer-go.org

You can reach the person managing the list at
        computer-go-ow...@computer-go.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Computer-go digest..."


Today's Topics:

   1. Re: Training the value network (a possibly more efficient
      approach) (Rémi Coulom)
   2. Re: Training the value network (a possibly more efficient
      approach) (Bo Peng)


----------------------------------------------------------------------

Message: 1
Date: Wed, 11 Jan 2017 11:35:41 +0100 (CET)
From: Rémi Coulom <remi.cou...@free.fr>
To: computer-go@computer-go.org
Subject: Re: [Computer-go] Training the value network (a possibly more
        efficient approach)
Message-ID:
        
<1202692593.296924720.1484130941650.javamail.r...@spooler6-g27.priv.proxad.net>
        
Content-Type: text/plain; charset=utf-8

Hi,

Thanks for sharing your idea.

In my experience it is rarely efficient to train value functions from very 
short term data (ie, next move). TD(lambda), or training from the final outcome 
of the game is often better, because it uses a longer horizon. But of course, 
it is difficult to tell without experiments whether your idea would work or 
not. The advantage of your ideas is that you can collect a lot of training data 
more easily.

Rémi

----- Mail original -----
De: "Bo Peng" <b...@withablink.com>
À: computer-go@computer-go.org
Envoyé: Mardi 10 Janvier 2017 23:25:19
Objet: [Computer-go] Training the value network (a possibly more efficient 
approach)


Hi everyone. It occurs to me there might be a more efficient method to train 
the value network directly (without using the policy network). 


You are welcome to check my method: http://withablink.com/GoValueFunction.pdf 


Let me know if there is any silly mistakes :) 

_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go


------------------------------

Message: 2
Date: Wed, 11 Jan 2017 18:48:59 +0800
From: Bo Peng <b...@withablink.com>
To: <computer-go@computer-go.org>
Subject: Re: [Computer-go] Training the value network (a possibly more
        efficient approach)
Message-ID: <d49c2db7.6d...@withablink.com>
Content-Type: text/plain;       charset="ISO-8859-1"

Hi Remi,

Thanks for sharing your experience.

As I am writing this, it seems there could be a third method: the perfect
value function shall have the minimax property in the obvious way. So we
can train our value function to satisfy the minimax property as well. In
fact, we can train it such that a shallow-level MCTS gives as close a
result as a deeper-level MCTS. This can be regarded as some kind of
bootstrapping.
 
Wonder if you have tried this. Seems might be a natural idea...

Bo

On 1/11/17, 18:35, "Computer-go on behalf of Rémi Coulom"
<computer-go-boun...@computer-go.org on behalf of remi.cou...@free.fr>
wrote:

>Hi,
>
>Thanks for sharing your idea.
>
>In my experience it is rarely efficient to train value functions from
>very short term data (ie, next move). TD(lambda), or training from the
>final outcome of the game is often better, because it uses a longer
>horizon. But of course, it is difficult to tell without experiments
>whether your idea would work or not. The advantage of your ideas is that
>you can collect a lot of training data more easily.
>
>Rémi
>
>----- Mail original -----
>De: "Bo Peng" <b...@withablink.com>
>À: computer-go@computer-go.org
>Envoyé: Mardi 10 Janvier 2017 23:25:19
>Objet: [Computer-go] Training the value network (a possibly more
>efficient approach)
>
>
>Hi everyone. It occurs to me there might be a more efficient method to
>train the value network directly (without using the policy network).
>
>
>You are welcome to check my method:
>http://withablink.com/GoValueFunction.pdf
>
>
>Let me know if there is any silly mistakes :)
>




------------------------------

Subject: Digest Footer

_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

------------------------------

End of Computer-go Digest, Vol 84, Issue 24
*******************************************
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to