Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

Daniel Sun, 19 Jul 2020 10:52:06 -0700

@Kensuke I suppose all the proposed algorithms ACT, SEARCH and LEARN are
meant to be used during training, no?
I think I understand ACT and LEARN but I am not sure about SEARCH for which
they say this:


> During search, we propose to stochastically sample actions according to
π¯ instead of the deterministic action selection rule of Eq. 1.

This sounds much like the random selection done at the root with
temperature, but this time applied at internal nodes.
Does it mean the pUCT formula is not used? Why does the selection have to
be stochastic now?
On selection, you compute π_bar every time from (q, π_theta, n_visits) so I
suppose π_bar has everything it needs to balance exploration and
exploitation.


On Sun, Jul 19, 2020 at 8:10 AM David Wu <lightvec...@gmail.com> wrote:

> I imagine that at low visits at least, "ACT" behaves similarly to Leela
> Zero's "LCB" move selection, which also has the effect of sometimes
> selecting a move that is not the max-visits move, if its value estimate has
> recently been found to be sufficiently larger to balance the fact that it
> is lower prior and lower visits (at least, typically, this is why the move
> wouldn't have been the max visits move in the first place). It also scales
> in an interesting way with empirical observed playout-by-playout variance
> of moves, but I think by far the important part is that it can use
> sufficiently confident high value to override max-visits.
>
> The gain from "LCB" in match play I recall is on the very very rough order
> of 100 Elo, although it could be less or more depending on match conditions
> and what neural net is used and other things. So for LZ at least,
> "ACT"-like behavior at low visits is not new.
>
>
> On Sun, Jul 19, 2020 at 5:39 AM Kensuke Matsuzaki <knsk.m...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I couldn't improve leela zero's strength by implementing SEARCH and ACT.
>> https://github.com/zakki/leela-zero/commits/regularized_policy
>>
>> 2020年7月17日(金) 2:47 Rémi Coulom <remi.cou...@gmail.com>:
>> >
>> > This looks very interesting.
>> >
>> > From a quick glance, it seems the improvement is mainly when the number
>> of playouts is small. Also they don't test on the game of Go. Has anybody
>> tried it?
>> >
>> > I will take a deeper look later.
>> >
>> > On Thu, Jul 16, 2020 at 9:49 AM Ray Tayek <rta...@ca.rr.com> wrote:
>> >>
>> >>
>> https://old.reddit.com/r/MachineLearning/comments/hrzooh/r_montecarlo_tree_search_as_regularized_policy/
>> >>
>> >>
>> >> --
>> >> Honesty is a very expensive gift. So, don't expect it from cheap
>> people - Warren Buffett
>> >> http://tayek.com/
>> >>
>> >> _______________________________________________
>> >> Computer-go mailing list
>> >> Computer-go@computer-go.org
>> >> http://computer-go.org/mailman/listinfo/computer-go
>> >
>> > _______________________________________________
>> > Computer-go mailing list
>> > Computer-go@computer-go.org
>> > http://computer-go.org/mailman/listinfo/computer-go
>>
>>
>>
>> --
>> Kensuke Matsuzaki
>> _______________________________________________
>> Computer-go mailing list
>> Computer-go@computer-go.org
>> http://computer-go.org/mailman/listinfo/computer-go
>>
> _______________________________________________
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>

_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Monte-Carlo Tree Search as Regularized Policy Optimization

Reply via email to