Hi, when explaining AlphaGo Zero to a machine learning audience yesterday
(https://docs.google.com/presentation/d/1VIueYgFciGr9pxiGmoQyUQ088Ca4ouvEFDPoWpRO4oQ/view) it occurred to me that using MCTS in this setup is actually such a kludge! Originally, we used MCTS because with the repeated simulations, we would be improving the accuracy of the arm reward estimates. MCTS policies assume stationary distributions, which is violated every time we expand the tree, but it's an okay tradeoff if all you feed into the tree are rewards in the form of just Bernoulli trials. Moreover, you could argue evaluations are somewhat monotonic with increasing node depths as you are basically just fixing a growing prefix of the MC simulation. But now, we expand the nodes literally all the time, breaking the stationarity possibly in drastic ways. There are no reevaluations that would improve your estimate. The input isn't binary but an estimate in a continuous space. Suddenly the Multi-armed Bandit analogy loses a lot of ground. Therefore, can't we take the next step, and do away with MCTS? Is there a theoretical viewpoint from which it still makes sense as the best policy improvement operator? What would you say is the current state-of-art game tree search for chess? That's a very unfamiliar world for me, to be honest all I really know is MCTS... -- Petr Baudis, Rossum Run before you walk! Fly before you crawl! Keep moving forward! If we fail, I'd rather fail really hugely. -- Moist von Lipwig _______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go