Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

Kensuke Matsuzaki Sun, 19 Jul 2020 17:45:02 -0700

I used stochastic sampling at internal nodes, because of this.
> During the forward simulation phase of SEARCH, the action at each node x is 
> selected by sampling a ∼ π¯(·|x).
> As a result, the full imaginary trajectory is generated consistently 
> according to policy π¯.


> In this section, we establish our main claim namely that AlphaZero’s action 
> selection criteria can be interpreted as approximating the solution to a 
> regularized policy-optimization objective.

I think they say UCT and PUCT is approximation of direct π¯ sampling,
but I haven't understood section 3 well.

2020年7月20日(月) 2:51 Daniel <dsha...@gmail.com>:
>
> @Kensuke I suppose all the proposed algorithms ACT, SEARCH and LEARN are 
> meant to be used during training, no?
> I think I understand ACT and LEARN but I am not sure about SEARCH for which 
> they say this:
>
> > During search, we propose to stochastically sample actions according to π¯ 
> > instead of the deterministic action selection rule of Eq. 1.
>
> This sounds much like the random selection done at the root with temperature, 
> but this time applied at internal nodes.
> Does it mean the pUCT formula is not used? Why does the selection have to be 
> stochastic now?
> On selection, you compute π_bar every time from (q, π_theta, n_visits) so I 
> suppose π_bar has everything it needs to balance exploration and exploitation.
>
>
> On Sun, Jul 19, 2020 at 8:10 AM David Wu <lightvec...@gmail.com> wrote:
>>
>> I imagine that at low visits at least, "ACT" behaves similarly to Leela 
>> Zero's "LCB" move selection, which also has the effect of sometimes 
>> selecting a move that is not the max-visits move, if its value estimate has 
>> recently been found to be sufficiently larger to balance the fact that it is 
>> lower prior and lower visits (at least, typically, this is why the move 
>> wouldn't have been the max visits move in the first place). It also scales 
>> in an interesting way with empirical observed playout-by-playout variance of 
>> moves, but I think by far the important part is that it can use sufficiently 
>> confident high value to override max-visits.
>>
>> The gain from "LCB" in match play I recall is on the very very rough order 
>> of 100 Elo, although it could be less or more depending on match conditions 
>> and what neural net is used and other things. So for LZ at least, "ACT"-like 
>> behavior at low visits is not new.
>>
>>
>> On Sun, Jul 19, 2020 at 5:39 AM Kensuke Matsuzaki <knsk.m...@gmail.com> 
>> wrote:
>>>
>>> Hi,
>>>
>>> I couldn't improve leela zero's strength by implementing SEARCH and ACT.
>>> https://github.com/zakki/leela-zero/commits/regularized_policy
>>>
>>> 2020年7月17日(金) 2:47 Rémi Coulom <remi.cou...@gmail.com>:
>>> >
>>> > This looks very interesting.
>>> >
>>> > From a quick glance, it seems the improvement is mainly when the number 
>>> > of playouts is small. Also they don't test on the game of Go. Has anybody 
>>> > tried it?
>>> >
>>> > I will take a deeper look later.
>>> >
>>> > On Thu, Jul 16, 2020 at 9:49 AM Ray Tayek <rta...@ca.rr.com> wrote:
>>> >>
>>> >> https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/
>>> >>
>>> >>
>>> >> --
>>> >> Honesty is a very expensive gift. So, don't expect it from cheap people 
>>> >> - Warren Buffett
>>> >> http://tayek.com/
>>> >>
>>> >> _______________________________________________
>>> >> Computer-go mailing list
>>> >> Computer-go@computer-go.org
>>> >> http://computer-go.org/mailman/listinfo/computer-go
>>> >
>>> > _______________________________________________
>>> > Computer-go mailing list
>>> > Computer-go@computer-go.org
>>> > http://computer-go.org/mailman/listinfo/computer-go
>>>
>>>
>>>
>>> --
>>> Kensuke Matsuzaki
>>> _______________________________________________
>>> Computer-go mailing list
>>> Computer-go@computer-go.org
>>> http://computer-go.org/mailman/listinfo/computer-go
>>
>> _______________________________________________
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>
> _______________________________________________
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go



-- 
Kensuke Matsuzaki
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

Reply via email to