What I am actually proposing is collapsing the results of the playouts offline and then having a function that maps board positions to playout values without actually doing playouts. So I would use an MC player to generate a lot of training pairs of the form (position, score) where position would be some sort of vector representation of the board position and score would be a single scalar value that corresponded to the value the Monte Carlo program decided after many simulations that the position had. So I would create something that learned a value function for positions. Then, once training was complete, this function could be evaluated very quickly and hopefully give the same results that, say, 100k playouts would give. But since it was so fast, the program could use the extra time to actually do more playouts. The difference would be that the new playouts would be biased by the value function. So the probability of picking a move would be proportional to the value function evaluated on the resulting position. This way I would be bootstrapping a general global position evaluator and using it to improve monte carlo simulations.
Imagine if you had a monte carlo program that took almost no time to run. You would use it to do "heavy" playouts for another monte carlo program to make it even stronger. The reason this might be easier than learning from a database of professional games is that it is easy to generate scalar scores of positions with a monte carlo program. Presumably it is also easier to learn how to play like a mediocre monte carlo program than like a professional player. - George On 5/17/07, Zach Keatts <[EMAIL PROTECTED]> wrote:
What you would have after your training/evaluator phase is a hueristic knowlege of possibly better montecarlo trees to consider. This will definitely cut down on the search space, but could also alienate a strong search path. I have been thinking along these same line for some time. The problem then lies in where you can decide what trees would be worth looking at initially. What about a database of professional games? Take the "winning" games as examples of strong searches that ended in a win. The problem is even more complex because where in the "winning tree" do you tell montecarlo to start searching? Will you assign a higher probability to each move in those games (defining a known probabilistically stronger predicted result)? That is one approach. The other approach is the purely simulated approach where you run simulations and gradually allow your probability function to evolve based on the results. Although this is a purer approach I think the aforementioned strategy may yield some useful experimentation. The strong point is that it will take advantage of the human brain's innate pattern recognition and calculation skills. Since we have recorded games we have plenty of examples of this "thought process". For a 9Dan winning game, those trees surely are worth investigating... On 5/16/07, George Dahl <[EMAIL PROTECTED]> wrote: > > I find Monte-Carlo Go a fascinating avenue of research, but what pains > me is that a huge number of simulations are performed each game and at > the end of the game the results are thrown out. So what I was > thinking is that perhaps the knowledge generated by the simulations > could be collapsed in some way. > > Suppose that epsilon greedy versions of a reasonably strong MC Go > program were to play a very large number of games against themselves. > By epsilon greedy versions I mean that with probability epsilon a > random move is played and with probability 1- epsilon the move the MC > Player would normally play is played. Each position in these games > would be stored along with the Monte Calro/UCT evaluation for that > position's desirability. This would produce an arbitrarily large > database of position/score pairs. At this point a general function > approximator / learning algorithm (such as a neural network) could be > trained to map positions to scores. If this was successful, it would > produce something that could very quickly (even a large neural net > evaluation or what have you would be much faster than doing a large > number of MC playouts) map positions to scores. Obviously the scores > would not be perfect since the monte carlo program did not play > anywhere near perfect Go. But this static evaluator could then be > plugged back into the monte carlo player and used to bias the random > playouts. Wouldn't it be useful to be able to quickly estimate the MC > score without doing any playouts? > > Clearly this idea could be extended recursively with a lot of offline > training. What makes this formulation more valuable is that given > enough time and effort someone familiar with machine learning should > be able to produce a learning architecture that can actually learn the > MC scores. It would be a straightforward, if potentially quite > difficult, supervised learning task with effectively unlimited data > since more could be generated at will. Such a learning architecture > could be used in the manner I described above or thrown at the more > general reinforcement learning problem. > > > Does anyone have any thoughts on this idea? Does anyone know of it > being tried before? > > - George > _______________________________________________ > computer-go mailing list > computer-go@computer-go.org > http://www.computer-go.org/mailman/listinfo/computer-go/ > _______________________________________________ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
_______________________________________________ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/