DeepMind has published a number of papers on how to stabilize RL strategies
in a landscape of nontransitive cycles. See
https://papers.nips.cc/paper/2018/file/cdf1035c34ec380218a8cc9a43d438f9-Paper.pdf

I haven't fully digested the paper, but what I'm getting from it is that if
you want your evaluation environment to be more independent of the
population of agents that you're evaluating against, you should first
compute a max-entropy Nash equilibrium of agents, and evaluate against this
equilibrium distribution.

To give a concrete example from the paper, imagine the CRPSS - the Computer
Rock Paper Scissors Server. Imagine there are currently 4 bots connected: a
Rock-only bot, a Paper-only bot, and two Scissor-only bots. The max-entropy
Nash equilibrium is 1/3, 1/3, 1/6, 1/6. So the duplicated Scissor bots are
naturally detected and their impact on the rating distribution is negated.
With CGOS's current evaluation scheme, the Rock bot would appear to have a
higher Elo score, because it has more opportunities to beat up on the two
Scissors bots.

The paper also proposes a vector extension to Elo that can better predict
outcomes under these nontransitive cycles.

Given that what we have is (at a macro level) duplication of various bot
lineages, and (at a micro level) rock-paper-scissors relationships between
bots in sharp openings, this paper seems quite relevant.


On Sat, Jan 23, 2021 at 5:34 AM Darren Cook <dar...@dcook.org> wrote:

> > ladders, not just liberties. In that case, yes! If you outright tell the
> > neural net as an input whether each ladder works or not (doing a short
> > tactical search to determine this), or something equivalent to it, then
> the
> > net will definitely make use of that information, ...
>
> Each convolutional layer should spread the information across the board.
> I think alpha zero used 20 layers? So even 3x3 filters would tell you
> about the whole board - though the signal from the opposite corner of
> the board might end up a bit weak.
>
> I think we can assume it is doing that successfully, because otherwise
> we'd hear about it losing lots of games in ladders.
>
> > something the first version of AlphaGo did (before they tried to make it
> > "zero") and something that many other bots do as well. But Leela Zero and
> > ELF do not do this, because of attempting to remain "zero", ...
>
> I know that zero-ness was very important to DeepMind, but I thought the
> open source dedicated go bots that have copied it did so because AlphaGo
> Zero was stronger than AlphaGo Master after 21-40 days of training.
> I.e. in the rarefied atmosphere of super-human play that starter package
> of human expert knowledge was considered a weight around its neck.
>
> BTW, I agree that feeding the results of tactical search in would make
> stronger programs, all else being equal. But it is branching code, so
> much slower to parallelize.
>
> Darren
> _______________________________________________
> Computer-go mailing list
> Computer-go@computer-go.org
> http://computer-go.org/mailman/listinfo/computer-go
>
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to