Re: [computer-go] producing a good probability distribution over legal moves

Don Dailey Thu, 17 May 2007 12:35:04 -0700

On Thu, 2007-05-17 at 10:54 -0500, Zach Keatts wrote:
> What you would have after your training/evaluator phase is a hueristic
> knowlege of possibly better montecarlo trees to consider.  This will
> definitely cut down on the search space, but could also alienate a
> strong search path.  I have been thinking along these same line for
> some time.  The problem then lies in where you can decide what trees
> would be worth looking at initially.  What about a database of
> professional games?  Take the "winning" games as examples of strong
> searches that ended in a win.  The problem is even more complex
> because where in the "winning tree" do you tell montecarlo to start
> searching?  Will you assign a higher probability to each move in those
> games (defining a known probabilistically stronger predicted result)?
>  
> That is one approach.  The other approach is the purely simulated
> approach where you run simulations and gradually allow your
> probability function to evolve based on the results.  Although this is
> a purer approach I think the aforementioned strategy may yield some
> useful experimentation.  The strong point is that it will take
> advantage of the human brain's innate pattern recognition and
> calculation skills.  Since we have recorded games we have plenty of
> examples of this "thought process".  For a 9Dan winning game, those
> trees surely are worth investigating...

I respectfully disagree with using human played games for extracting
pattern
knowledge.   I also don't believe the moves of a 9 Dan player vary
significantly from those of a 4 Dan player in terms of practical
knowledge that can be learned effectively.  Rarely will their moves vary
significantly in quality and not in ways that any knowledge extraction
process will be able to gobble up to some huge advantage.   

It may "work" to some degree, but I don't believe it is the best
approach.  The problem is that humans play idiosyncratic.   We talked
about this a few years ago on this list, but I have many years of
experience in trying to glean knowledge from grandmasters for computer
chess.    The fact of the matter is they don't really represent what
needs to be learned for a strong program.   They often present a chicken
and egg problem - they correctly avoid certain lines of play but you
don't really know why, but they of course do.   You won't get to really
learn why.  

Lazarus recently took a pretty nice jump in strength by using a simple
pattern extraction process.   I simply culled 3x3 patterns from it's OWN
games.   I used a simple statistical approach but I think the approach
mentioned in the paper is probably much better.   Nevertheless, the
patterns I extracted came from Lazarus
own games.  

The search is modeled very similarly to Mogo as documented in the Mogo
paper with just a few differences.   However I used the patterns to find
moves that could be safely "vetoed" from both the UCT search and the
play-outs.   They
were basically moves (based on the 3x3 pattern around the move) that
were
almost never chosen by a version of Lazarus playing at "full strength."

I could have used human games, but humans could easily upset the pattern
extraction process.   For instance, if a human was "horsing around" he
might easily have caused some very low urgency patterns (which is what I
am looking for) to get eliminated from consideration.    

Since Lazarus at 9x9 is reasonably strong, I feel confident that the
patterns have a certain "consistency" that cannot be duplicated by human
played games.  It's just hard to fully trust those games but Lazarus is
a known entity.  

One situation to be avoided is extracting patterns near the end of the
game, when the result is a forgone conclusion.   This is a combination
of problems with the play to the bitter end and the UCT MC scoring
approach of maximizing win probabilities.   If the position is already a
win for either side,  almost any move is playable.    This could also a
problem in human games.  You don't really know when a position is in
cleanup mode and the specific patterns have less importance.  Of course
in Japanese rules games this is not as much of an issue but I think it
still affects the quality of the patterns.

So I only extract patterns from the root position when Lazarus believes
there is still a lot of fight left in the game.   If the score is too
close to either extreme (win or loss) I don't extract patterns.    In
this way,  I know the patterns are not arbitrary moves - there is still
a lot of play left in the game and each move is still important.    

In human games I believe it is much more difficult to control variables
like this.  

The patterns Lazarus learns are applied to the next generation of
program.  The new program is used after that to further generate
patterns.   Even though Lazarus may be significantly stronger after an
iteration of this,  I don't feel that the quality of the patterns is
amazingly different.    I can only hope that it is slightly less likely
to consider an awful move to a bad pattern.

It's hard for me to imagine that I would get significantly better
results extracting these patterns from human games.    Unless there is
some common pattern that is really bad that Lazarus thinks is good.
But this is really not very  likely in my opinion because Lazarus is
going to discover the statistically bad patterns   There is a general
rule of thumb that I try to  apply to game playing programs - the
strength of the program is more about the bad moves that it is about the
good moves.   

I'm also interested in trying to "bootstrap" knowledge without using
external sources.  If a program can learn from it's own successes and
failures, which is the fundamental design principle of Monte/Carlo
methods anyway, then why can't it progressively learn from better
versions of itself?    

I have had great success with automated learning with computer chess.  I
created a chess program with a hand tuned evaluation function (one that
I tuned.)   Then I played hundreds of self-test games at a fairly deep
level.   Using population based learning algorithm approach (PBIL in
this case) I tuned the 1 ply search to try to return the same score as
the deeper search.    I took a lot of pressure off the learning
algorithm by only requiring it to make small adjustments to existing
values.   The tuning function tried to produce evaluation weights that
were better than the MY weights.  

One can imagine that if your initial program thought a queen was
worthless, a deep search would often find that having the queen would
win other material (that wasn't worthless.)   So even if the queen was
valued at zero,  a deep search would "instinctively" see that it had
real value.   (I proved this by assigning a queen a value of zero,  and
doing deep searches with and without it's physical presence and the deep
search is much happier with the queen even thought the evaluation
function knows nothing about the queen.)   

After the weights would converge, I would run a long match test against
an external program (in order to benchmark the program "independently",
the foreign program is in no way involved in the learning process.)
Since I am making only small adjustments to the weights, I can apply the
new weights, and reiterate the learning process, hopefully improving
with each pass.

I was amazed at how much improvement this automated tuning effort
produced.  I had adjusted the level of the external program such that MY
program would score about 40 percent of the games, but by the time I had
made a few passes (where each pass took 2 or 3 days) I had to increase
the level of the external program by a whole ply in order not to
dominate it in the benchmarking episodes.   The improvement was so
noticeable that it was immediately obvious by the quality of the moves
when I observed the games manually.    

This learning method, is a kind of hybrid - closely related to Temporal
Difference Learning but with a twist (using genetic type algorithms.)
TDL learns by trying to solve the credit assignment problem and applying
this to feature weights in an evaluation function.   

I could have started this by setting all weights to zero.   When wins
and losses were eventually detected near the end of the games,  credit
would have quickly been assigned to the evaluation terms that most
successfully procured  these wins.  On subsequent iterations you would
see these evaluation terms migrate towards values that gave better
playing results.  

I'm now wondering if Rémi's paper can be applied to this approach.  In
his "ELO" approach I would not be trying to rate a given pattern, but
instead I would be rating the value of a given weight assigned to a
specific evaluation term.   I think that's how it might work ... a lot
to think about.

I would like to apply my approach to GO with learning patterns, but I'm
not quite sure how to proceed on this one.  

- Don

>  
> On 5/16/07, George Dahl <[EMAIL PROTECTED]> wrote: 
>         I find Monte-Carlo Go a fascinating avenue of research, but
>         what pains
>         me is that a huge number of simulations are performed each
>         game and at 
>         the end of the game the results are thrown out.  So what I was
>         thinking is that perhaps the knowledge generated by the
>         simulations
>         could be collapsed in some way.
>         
>         Suppose that epsilon greedy versions of a reasonably strong MC
>         Go 
>         program were to play a very large number of games against
>         themselves.
>         By epsilon greedy versions I mean that with probability
>         epsilon a
>         random move is played and with probability 1- epsilon the move
>         the MC
>         Player would normally play is played.  Each position in these
>         games 
>         would be stored along with the Monte Calro/UCT evaluation for
>         that
>         position's desirability.  This would produce an arbitrarily
>         large
>         database of position/score pairs.  At this point a general
>         function
>         approximator / learning algorithm (such as a neural network)
>         could be 
>         trained to map positions to scores.  If this was successful,
>         it would
>         produce something that could very quickly (even a large neural
>         net
>         evaluation or what have you would be much faster than doing a
>         large
>         number of MC playouts) map positions to scores.  Obviously the
>         scores 
>         would not be perfect since the monte carlo program did not
>         play
>         anywhere near perfect Go.  But this static evaluator could
>         then be
>         plugged back into the monte carlo player and used to bias the
>         random
>         playouts.  Wouldn't it be useful to be able to quickly
>         estimate the MC 
>         score without doing any playouts?
>         
>         Clearly this idea could be extended recursively with a lot of
>         offline
>         training.  What makes this formulation more valuable is that
>         given
>         enough time and effort someone familiar with machine learning
>         should 
>         be able to produce a learning architecture that can actually
>         learn the
>         MC scores.  It would be a straightforward, if potentially
>         quite
>         difficult, supervised learning task with effectively unlimited
>         data
>         since more could be generated at will.  Such a learning
>         architecture 
>         could be used in the manner I described above or thrown at the
>         more
>         general reinforcement learning problem.
>         
>         
>         Does anyone have any thoughts on this idea?  Does anyone know
>         of it
>         being tried before?
>         
>         - George
>         _______________________________________________
>         computer-go mailing list
>         computer-go@computer-go.org
>         http://www.computer-go.org/mailman/listinfo/computer-go/
> 
> _______________________________________________
> computer-go mailing list
> computer-go@computer-go.org
> http://www.computer-go.org/mailman/listinfo/computer-go/

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] producing a good probability distribution over legal moves

Reply via email to