I tried to find those papers once, too, and I also failed. It seems to
predate Internet publishing. Some key results were proved in the 1970's.
I think the following ideas sketch a proof:
- sample each arm infinitely often
- such that the fraction of effort going to sub-optimal arms
decreases to zero
Now look at the form of Hoeffding's inequality:
http://en.wikipedia.org/wiki/Hoeffding's_inequality. The UCB term is like
the functional inverse of the right-hand side. That is why it works for
bandit problems.
To go from bandit theory to UCT, you have to show that recursive application
of the same process converges, which is the proof that you see in the Auer
paper.
Note that there are solutions to the bandit problem that would not be
solutions for MCTS, because they do not converge recursively. E.g., you can
make an epsilon-greedy policy that works for a bandit problem ("works" ==
zero asymptotic regret), but would not converge to optimal in an MCTS
setting.
Hope this helps.
Brian
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Petr Baudis
Sent: Wednesday, October 26, 2011 8:51 AM
To: [email protected]
Subject: Re: [Computer-go] Multi-armed bandit problem theory
On Wed, Oct 26, 2011 at 01:56:03PM +0200, "Ingo Althöfer" wrote:
> Not a direct answer, but some bit of information:
> Bandit theory started in the early 1950' by Herbert Robbins
> (the same Robbins from the 1985 paper). However, he did
> not prove best possible bounds in the seminal paper.
Yes, I actually have a copy of that paper but it doesn't seem that it
could help me better understand the later results.
Petr "Pasky" Baudis
_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
_______________________________________________
Computer-go mailing list
[email protected]
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go