Re: [computer-go] producing a good probability distribution over legal moves

George Dahl Sat, 19 May 2007 13:25:33 -0700

Why does this pose a problem?  Presumably the monte carlo evaluator
will give the same position a similar score assuming it has enough
time.  This would just cause a duplicate training pattern, or two
training patterns with identical input and slightly different output.
I guess I don't quite understand what the issue would be.


- George

On 5/17/07, Daniel Burgos <[EMAIL PROTECTED]> wrote:

But it is very difficult that a board position is repeated between games. I
don't see how you will use the training pairs in the new games.

2007/5/17, George Dahl < [EMAIL PROTECTED]>:
>
> What I am actually proposing is collapsing the results of the playouts
> offline and then having a function that maps board positions to
> playout values without actually doing playouts.  So I would use an MC
> player to generate a lot of training pairs of the form (position,
> score) where position would be some sort of vector representation of
> the board position and score would be a single scalar value that
> corresponded to the value the Monte Carlo program decided after many
> simulations that the position had.  So I would create something that
> learned a value function for positions.  Then, once training was
> complete, this function could be evaluated very quickly and hopefully
> give the same results that, say, 100k playouts would give.  But since
> it was so fast, the program could use the extra time to actually do
> more playouts.  The difference would be that the new playouts would be
> biased by the value function.  So the probability of picking a move
> would be proportional to the value function evaluated on the resulting
> position.  This way I would be bootstrapping a general global position
> evaluator and using it to improve monte carlo simulations.
>
> Imagine if you had a monte carlo program that took almost no time to
> run.  You would use it to do "heavy" playouts for another monte carlo
> program to make it even stronger.
>
> The reason this might be easier than learning from a database of
> professional games is that it is easy to generate scalar scores of
> positions with a monte carlo program.  Presumably it is also easier to
> learn how to play like a mediocre monte carlo program than like a
> professional player.
>
> - George
>
>
>
> On 5/17/07, Zach Keatts <[EMAIL PROTECTED]> wrote:
> > What you would have after your training/evaluator phase is a hueristic
> > knowlege of possibly better montecarlo trees to consider.  This will
> > definitely cut down on the search space, but could also alienate a
strong
> > search path.  I have been thinking along these same line for some
time.  The
> > problem then lies in where you can decide what trees would be worth
looking
> > at initially.  What about a database of professional games?  Take the
> > "winning" games as examples of strong searches that ended in a win.  The
> > problem is even more complex because where in the "winning tree" do you
tell
> > montecarlo to start searching?  Will you assign a higher probability to
each
> > move in those games (defining a known probabilistically stronger
predicted
> > result)?
> >
> > That is one approach.  The other approach is the purely simulated
approach
> > where you run simulations and gradually allow your probability function
to
> > evolve based on the results.  Although this is a purer approach I think
the
> > aforementioned strategy may yield some useful experimentation.  The
strong
> > point is that it will take advantage of the human brain's innate pattern
> > recognition and calculation skills.  Since we have recorded games we
have
> > plenty of examples of this "thought process".  For a 9Dan winning game,
> > those trees surely are worth investigating...
> >
> >
> > On 5/16/07, George Dahl <[EMAIL PROTECTED]> wrote:
> > >
> > > I find Monte-Carlo Go a fascinating avenue of research, but what pains
> > > me is that a huge number of simulations are performed each game and at
> > > the end of the game the results are thrown out.  So what I was
> > > thinking is that perhaps the knowledge generated by the simulations
> > > could be collapsed in some way.
> > >
> > > Suppose that epsilon greedy versions of a reasonably strong MC Go
> > > program were to play a very large number of games against themselves.
> > > By epsilon greedy versions I mean that with probability epsilon a
> > > random move is played and with probability 1- epsilon the move the MC
> > > Player would normally play is played.  Each position in these games
> > > would be stored along with the Monte Calro/UCT evaluation for that
> > > position's desirability.  This would produce an arbitrarily large
> > > database of position/score pairs.  At this point a general function
> > > approximator / learning algorithm (such as a neural network) could be
> > > trained to map positions to scores.  If this was successful, it would
> > > produce something that could very quickly (even a large neural net
> > > evaluation or what have you would be much faster than doing a large
> > > number of MC playouts) map positions to scores.  Obviously the scores
> > > would not be perfect since the monte carlo program did not play
> > > anywhere near perfect Go.  But this static evaluator could then be
> > > plugged back into the monte carlo player and used to bias the random
> > > playouts.  Wouldn't it be useful to be able to quickly estimate the MC
> > > score without doing any playouts?
> > >
> > > Clearly this idea could be extended recursively with a lot of offline
> > > training.  What makes this formulation more valuable is that given
> > > enough time and effort someone familiar with machine learning should
> > > be able to produce a learning architecture that can actually learn the
> > > MC scores.  It would be a straightforward, if potentially quite
> > > difficult, supervised learning task with effectively unlimited data
> > > since more could be generated at will.  Such a learning architecture
> > > could be used in the manner I described above or thrown at the more
> > > general reinforcement learning problem.
> > >
> > >
> > > Does anyone have any thoughts on this idea?  Does anyone know of it
> > > being tried before?
> > >
> > > - George
> > > _______________________________________________
> > > computer-go mailing list
> > > computer-go@computer-go.org
> > >
http://www.computer-go.org/mailman/listinfo/computer-go/
> > >
> >
> >
> > _______________________________________________
> > computer-go mailing list
> > computer-go@computer-go.org
> >
http://www.computer-go.org/mailman/listinfo/computer-go/
> >
> _______________________________________________
> computer-go mailing list
> computer-go@computer-go.org
> http://www.computer-go.org/mailman/listinfo/computer-go/
>


_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

_______________________________________________
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.org/mailman/listinfo/computer-go/

Re: [computer-go] producing a good probability distribution over legal moves

Reply via email to