Re: [Computer-go] Accelerating Self-Play Learning in Go

David Wu Fri, 08 Mar 2019 08:20:46 -0800

On Fri, Mar 8, 2019 at 8:19 AM Darren Cook <dar...@dcook.org> wrote:

> > Blog post:
> > https://blog.janestreet.com/accelerating-self-play-learning-in-go/
> > Paper: https://arxiv.org/abs/1902.10565
>
> I read the paper, and really enjoyed it: lots of different ideas being
> tried. I was especially satisfied to see figure 12 and the big
> difference giving some go features made.
>
>
Thanks, glad you enjoyed the paper.



> Though it would be good to see figure 8 shown in terms of wall clock
> time, on equivalent hardware. How much extra computation do all the
> extra ideas add? (Maybe it is in the paper, and I missed it?)
>
>
I suspect Leela Zero would come off as far *less* favorable if one tried to
do such a comparison using their actual existing code rather than
abstracting down to counting neural net evals, because as far as I know in
Leela Zero there is no cross-game batching of neural net evaluations, which
makes a huge difference in the ability to use a strong GPU efficiently.
Only in the last couple months or so based on what I've been seeing in chat
and pull requests, Leela Zero implemented within-search batching of neural
net evals, but clients still only play one game at a time.

But maybe this is a distraction from your actual question, which is how
much do these extra things slow the process down in computational time
given both equivalent hardware *and* equivalently good architecture. Mostly
they almost don't have any cost, which is why the paper doesn't really
focus on that question. The thing to keep in mind is that almost all of the
cost is the GPU-bound evaluation of the convolutional layers in the neural
net during self-play.

* Ownership and score distribution are not used during selfplay (except
optionally ownership at the root node for a minor heuristic), so they don't
contribute to selfplay cost. Also even on the training side, they are only
a slight cost (at most a few percent), since this is just some computations
at the output head of the neural net needing vastly fewer floating point
ops than are in the convolutions in the main trunk of the net.

* Global pooling costs nothing, since it does not add new convolutions in
my implementation, instead only re-purposing existing channels. In my
implementation, it actually *reduces* the number of parameters in the model
and (I believe) the nominal number of floating point ops in the model,
since the re-purposing of some channels to be pooled reduces the number of
channels feeding into the convolution of the next layer. This is offset by
the cost of doing the pooling and the additional GPU calls, netting out to
about 0 cost.

* Multi-board-size masking cost is also very small if your GPU
implementation fuses the mask with adjacent batch-norm bias+scale
operations.

* Go-specific input features add about a 10% cost on my hardware when using
the 10b128c net, due to a combination of ladder search being not cheap, and
the extra IO that you have to do to the GPU to communicate the features,
and presumably closer to 5% for 15b192c and continuing to decrease if you
move to larger nets, as the CPU and IO cost become more and more irrelevant
as the proportion of GPU work grows.

* Playout/visit cap oscillation is just a change of some root-level search
parameters. Target pruning is just some cheap CPU postprocessing. The cost
of writing down the various additional targets somewhat expands the size of
the training data on disk, but is pretty cheap with a good SSD. I think
nothing else adds any cost.

> I found some other interesting results, too - for example contrary to
> > intuition built up from earlier-generation MCTS programs in Go,
> > putting significant weight on score maximization rather than only
> > win/loss seems to help.
>
> Score maximization in self-play means it is encouraged to play more
> aggressively/dangerously, by creating life/death problems on the board.
> A player of similar strength doesn't know how to exploit the weaknesses
> left behind. (One of the asymmetries of go?)
>

Note that testing games were between random pairs of players proportional
to p(1-p) from estimated Elos. Even for a 200 Elo difference, p = 0.76,
then p(1-p) = 0.18, which is not that much smaller than when p = 0.5 giving
p(1-p) = 0.25. So quite many testing games were between players of fairly
different strengths.

_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Re: [Computer-go] Accelerating Self-Play Learning in Go

Reply via email to