On Thu, 2007-05-17 at 10:54 -0500, Zach Keatts wrote: > What you would have after your training/evaluator phase is a hueristic > knowlege of possibly better montecarlo trees to consider. This will > definitely cut down on the search space, but could also alienate a > strong search path. I have been thinking along these same line for > some time. The problem then lies in where you can decide what trees > would be worth looking at initially. What about a database of > professional games? Take the "winning" games as examples of strong > searches that ended in a win. The problem is even more complex > because where in the "winning tree" do you tell montecarlo to start > searching? Will you assign a higher probability to each move in those > games (defining a known probabilistically stronger predicted result)? > > That is one approach. The other approach is the purely simulated > approach where you run simulations and gradually allow your > probability function to evolve based on the results. Although this is > a purer approach I think the aforementioned strategy may yield some > useful experimentation. The strong point is that it will take > advantage of the human brain's innate pattern recognition and > calculation skills. Since we have recorded games we have plenty of > examples of this "thought process". For a 9Dan winning game, those > trees surely are worth investigating...
I respectfully disagree with using human played games for extracting pattern knowledge. I also don't believe the moves of a 9 Dan player vary significantly from those of a 4 Dan player in terms of practical knowledge that can be learned effectively. Rarely will their moves vary significantly in quality and not in ways that any knowledge extraction process will be able to gobble up to some huge advantage. It may "work" to some degree, but I don't believe it is the best approach. The problem is that humans play idiosyncratic. We talked about this a few years ago on this list, but I have many years of experience in trying to glean knowledge from grandmasters for computer chess. The fact of the matter is they don't really represent what needs to be learned for a strong program. They often present a chicken and egg problem - they correctly avoid certain lines of play but you don't really know why, but they of course do. You won't get to really learn why. Lazarus recently took a pretty nice jump in strength by using a simple pattern extraction process. I simply culled 3x3 patterns from it's OWN games. I used a simple statistical approach but I think the approach mentioned in the paper is probably much better. Nevertheless, the patterns I extracted came from Lazarus own games. The search is modeled very similarly to Mogo as documented in the Mogo paper with just a few differences. However I used the patterns to find moves that could be safely "vetoed" from both the UCT search and the play-outs. They were basically moves (based on the 3x3 pattern around the move) that were almost never chosen by a version of Lazarus playing at "full strength." I could have used human games, but humans could easily upset the pattern extraction process. For instance, if a human was "horsing around" he might easily have caused some very low urgency patterns (which is what I am looking for) to get eliminated from consideration. Since Lazarus at 9x9 is reasonably strong, I feel confident that the patterns have a certain "consistency" that cannot be duplicated by human played games. It's just hard to fully trust those games but Lazarus is a known entity. One situation to be avoided is extracting patterns near the end of the game, when the result is a forgone conclusion. This is a combination of problems with the play to the bitter end and the UCT MC scoring approach of maximizing win probabilities. If the position is already a win for either side, almost any move is playable. This could also a problem in human games. You don't really know when a position is in cleanup mode and the specific patterns have less importance. Of course in Japanese rules games this is not as much of an issue but I think it still affects the quality of the patterns. So I only extract patterns from the root position when Lazarus believes there is still a lot of fight left in the game. If the score is too close to either extreme (win or loss) I don't extract patterns. In this way, I know the patterns are not arbitrary moves - there is still a lot of play left in the game and each move is still important. In human games I believe it is much more difficult to control variables like this. The patterns Lazarus learns are applied to the next generation of program. The new program is used after that to further generate patterns. Even though Lazarus may be significantly stronger after an iteration of this, I don't feel that the quality of the patterns is amazingly different. I can only hope that it is slightly less likely to consider an awful move to a bad pattern. It's hard for me to imagine that I would get significantly better results extracting these patterns from human games. Unless there is some common pattern that is really bad that Lazarus thinks is good. But this is really not very likely in my opinion because Lazarus is going to discover the statistically bad patterns There is a general rule of thumb that I try to apply to game playing programs - the strength of the program is more about the bad moves that it is about the good moves. I'm also interested in trying to "bootstrap" knowledge without using external sources. If a program can learn from it's own successes and failures, which is the fundamental design principle of Monte/Carlo methods anyway, then why can't it progressively learn from better versions of itself? I have had great success with automated learning with computer chess. I created a chess program with a hand tuned evaluation function (one that I tuned.) Then I played hundreds of self-test games at a fairly deep level. Using population based learning algorithm approach (PBIL in this case) I tuned the 1 ply search to try to return the same score as the deeper search. I took a lot of pressure off the learning algorithm by only requiring it to make small adjustments to existing values. The tuning function tried to produce evaluation weights that were better than the MY weights. One can imagine that if your initial program thought a queen was worthless, a deep search would often find that having the queen would win other material (that wasn't worthless.) So even if the queen was valued at zero, a deep search would "instinctively" see that it had real value. (I proved this by assigning a queen a value of zero, and doing deep searches with and without it's physical presence and the deep search is much happier with the queen even thought the evaluation function knows nothing about the queen.) After the weights would converge, I would run a long match test against an external program (in order to benchmark the program "independently", the foreign program is in no way involved in the learning process.) Since I am making only small adjustments to the weights, I can apply the new weights, and reiterate the learning process, hopefully improving with each pass. I was amazed at how much improvement this automated tuning effort produced. I had adjusted the level of the external program such that MY program would score about 40 percent of the games, but by the time I had made a few passes (where each pass took 2 or 3 days) I had to increase the level of the external program by a whole ply in order not to dominate it in the benchmarking episodes. The improvement was so noticeable that it was immediately obvious by the quality of the moves when I observed the games manually. This learning method, is a kind of hybrid - closely related to Temporal Difference Learning but with a twist (using genetic type algorithms.) TDL learns by trying to solve the credit assignment problem and applying this to feature weights in an evaluation function. I could have started this by setting all weights to zero. When wins and losses were eventually detected near the end of the games, credit would have quickly been assigned to the evaluation terms that most successfully procured these wins. On subsequent iterations you would see these evaluation terms migrate towards values that gave better playing results. I'm now wondering if RĂ©mi's paper can be applied to this approach. In his "ELO" approach I would not be trying to rate a given pattern, but instead I would be rating the value of a given weight assigned to a specific evaluation term. I think that's how it might work ... a lot to think about. I would like to apply my approach to GO with learning patterns, but I'm not quite sure how to proceed on this one. - Don > > On 5/16/07, George Dahl <[EMAIL PROTECTED]> wrote: > I find Monte-Carlo Go a fascinating avenue of research, but > what pains > me is that a huge number of simulations are performed each > game and at > the end of the game the results are thrown out. So what I was > thinking is that perhaps the knowledge generated by the > simulations > could be collapsed in some way. > > Suppose that epsilon greedy versions of a reasonably strong MC > Go > program were to play a very large number of games against > themselves. > By epsilon greedy versions I mean that with probability > epsilon a > random move is played and with probability 1- epsilon the move > the MC > Player would normally play is played. Each position in these > games > would be stored along with the Monte Calro/UCT evaluation for > that > position's desirability. This would produce an arbitrarily > large > database of position/score pairs. At this point a general > function > approximator / learning algorithm (such as a neural network) > could be > trained to map positions to scores. If this was successful, > it would > produce something that could very quickly (even a large neural > net > evaluation or what have you would be much faster than doing a > large > number of MC playouts) map positions to scores. Obviously the > scores > would not be perfect since the monte carlo program did not > play > anywhere near perfect Go. But this static evaluator could > then be > plugged back into the monte carlo player and used to bias the > random > playouts. Wouldn't it be useful to be able to quickly > estimate the MC > score without doing any playouts? > > Clearly this idea could be extended recursively with a lot of > offline > training. What makes this formulation more valuable is that > given > enough time and effort someone familiar with machine learning > should > be able to produce a learning architecture that can actually > learn the > MC scores. It would be a straightforward, if potentially > quite > difficult, supervised learning task with effectively unlimited > data > since more could be generated at will. Such a learning > architecture > could be used in the manner I described above or thrown at the > more > general reinforcement learning problem. > > > Does anyone have any thoughts on this idea? Does anyone know > of it > being tried before? > > - George > _______________________________________________ > computer-go mailing list > computer-go@computer-go.org > http://www.computer-go.org/mailman/listinfo/computer-go/ > > _______________________________________________ > computer-go mailing list > computer-go@computer-go.org > http://www.computer-go.org/mailman/listinfo/computer-go/ _______________________________________________ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/