Hi,

  when explaining AlphaGo Zero to a machine learning audience yesterday

        
(https://docs.google.com/presentation/d/1VIueYgFciGr9pxiGmoQyUQ088Ca4ouvEFDPoWpRO4oQ/view)

it occurred to me that using MCTS in this setup is actually such
a kludge!

  Originally, we used MCTS because with the repeated simulations,
we would be improving the accuracy of the arm reward estimates.  MCTS
policies assume stationary distributions, which is violated every time
we expand the tree, but it's an okay tradeoff if all you feed into the
tree are rewards in the form of just Bernoulli trials.  Moreover, you
could argue evaluations are somewhat monotonic with increasing node
depths as you are basically just fixing a growing prefix of the MC
simulation.

  But now, we expand the nodes literally all the time, breaking the
stationarity possibly in drastic ways.  There are no reevaluations that
would improve your estimate.  The input isn't binary but an estimate in
a continuous space.  Suddenly the Multi-armed Bandit analogy loses a lot
of ground.

  Therefore, can't we take the next step, and do away with MCTS?  Is
there a theoretical viewpoint from which it still makes sense as the best
policy improvement operator?

  What would you say is the current state-of-art game tree search for
chess?  That's a very unfamiliar world for me, to be honest all I really
know is MCTS...

-- 
                                        Petr Baudis, Rossum
        Run before you walk! Fly before you crawl! Keep moving forward!
        If we fail, I'd rather fail really hugely.  -- Moist von Lipwig
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to