[Computer-go] Reinforcement learning of move predictor in MTCS

ChtiGo via Computer-go Fri, 10 Feb 2017 00:27:13 -0800

A question / thought on move predictor used to bias search in MCTS:


Policy network used as move recommendation function in MTCS following Alphago 
Nature paper is optimized by SL to predict experts moves. This policy can then 
be optimized by RL to win games (in greedy play mode). A MCTS agent using this 
RL policy as move recommendation performs less good than a MCTS with the SL 
policy. This raises the question of how to go beyond move predictors learnt 
from human experts. 

  

Are there any reinforcement learning method to directly optimize a move 
recommendation function (i.e. towards the goal of making the corresponding MCTS 
agent stronger) ? 
>From a RL theoritical point of view, is it possible to define a target for 
>move recommendation function in MCTS agent ? This may also depend on how the 
>prior probabilities are used to bias the search in MCTS. 

  

Quoting Silver et al. , about their variant of PUCT used in the selection phase 
of AG : "this search control strategy initially prefers actions with high prior 
probability and low visit 

count, but asympotically prefers actions with high action-value". Thus, would 
it be possible to used action-values estimated after some MCTS search budget 
(let's say 10000 nodes to 

illustrate) as targets for optimizing the move recommendation function (through 
conversion by a softmax e.g.) ? 

  

I'am aware this might be a naive approach with many pitfalls: 
- remains to be proven that such a policy would make the whole MCTS performs 
better; moving away from human learnt predictions might just weaken the MCTS 
agent ; 
- reinforcement of move predictor may not be a key issue of todays MCTS program 
strength 
- "asympotically prefers actions with high action-value" might be just a 
misleading perspective, because of the word "asymptotic" ;-) 
- generating pairs of prior / post probalities for SL would be be very 
expensive in practice, making this totally intractable. 

  

Patrick

_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

[Computer-go] Reinforcement learning of move predictor in MTCS

Reply via email to