>
>
> Just to clarify: I was not saying that Mogo's policy consisted
> *solely* of looking for patterns around the last move. Merely that
> it does not look for patterns around *every* point, which other
> playout policies (e.g., CrazyStone, if I understand Remi's papers
> correctly) appear to do.
>> . Pebbles has a "Mogo" playout design, where you check
>> for patterns only around the last move (or two).
>>
>
>In MoGo, it's not only around the last move (at least with some probability
>and when there are empty spaces in the board); this is the "fill board"
>modification.
Just to clarify: I
> . Pebbles has a "Mogo" playout design, where you check
> for patterns only around the last move (or two).
>
In MoGo, it's not only around the last move (at least with some probability
and when there are empty spaces in the board); this is the "fill board"
modification.
(this provides a big impr
>Is anyone (besides the authors) doing research based on this?
Well, Pebbles does apply reinforcement learning (RL) to improve
its playout policy. But not in the manner described in that paper.
There are practical obstacles to directly applying that paper.
To directly apply that paper, you must h
A web search turned up a 2 page and an 8 page version. I read the
short one. I agree that it's promising work that requires some follow-
up research.
Now that you've read it so many times, what excites you about it? Can
you envision a way to scale it to larger patterns and boards on modern
In future papers they should avoid using a strong authority like Fuego for the training and instead force it to learn from a naive uniform random playout policy
(with 100x or 1000x more playouts) and then build on that with an iterative approach (as was suggested in the paper).
I also had anothe
I admit I had trouble understanding the details of the paper. What I
think is the biggest problem for applying this to bigger (up to 19x19)
games is that you somehow need access to the "true" value of a move,
i.e. it's a win or a loss. On the 5x5 board they used, this might be
approximated
After about the 5th reading, I'm concluding that this is an excellent paper.
Is anyone (besides the authors) doing research based on this? There is a lot
to do.
David Silver wrote:
Hi everyone,
Please find attached my ICML paper with Gerry Tesauro on automatically
learning a simulation po
Has anyone tried this algorithm improvement on bigger boards and can give us
a result?
Link to original message:
http://computer-go.org/pipermail/computer-go/2009-April/018159.html
Thanks,
ibd
> > So maybe I could get 600 more Elo points
> > with your method. And even more on 19x19.
> > I notice
Hi,
We used alpha=0.1. There may well be a better setting of alpha, but
this appeared to work nicely in our experiments.
-Dave
On 3-May-09, at 2:01 AM, elife wrote:
Hi Dave,
In your experiments what's the constant value alpha you set?
Thanks.
2009/5/1 David Silver :
Yes, in our experi
Hi Dave,
In your experiments what's the constant value alpha you set?
Thanks.
2009/5/1 David Silver :
> Yes, in our experiments they were just constant numbers M=N=100.
___
computer-go mailing list
computer-go@computer-go.org
http://www.computer-go.
Hi Yamato,
If M and N are the same, is there any reason to run M simulations and
N simulations separately? What happens if you combine them and
calculate
V and g in the single loop?
I think it gives the wrong answer to do it in a single loop. Note that
the simulation outcomes z are used
David Silver wrote:
>Yes, in our experiments they were just constant numbers M=N=100.
If M and N are the same, is there any reason to run M simulations and
N simulations separately? What happens if you combine them and calculate
V and g in the single loop?
>Okay, let's continue the example above
IMO other people's equations/code/ideas/papers always seem smarter
than your own. The stuff you understand and do yourself just seems
like common sense, and the stuff you don't always has a mystical air
of complexity, at least until you understand it too :-)
On 30-Apr-09, at 1:59 PM, Michae
Hi Yamato,
Thanks for the detailed explanation.
M, N and alpha are constant numbers, right? What did you set them to?
You're welcome!
Yes, in our experiments they were just constant numbers M=N=100.
The feature vector is the set of patterns you use, with value 1 if a
pattern is matched and
I wish I was smart :(
David Silver wrote:
Hi Remi,
I understood this. What I find strange is that using -1/1 should be
equivalent to using 0/1, but your algorithm behaves differently: it
ignores lost games with 0/1, and uses them with -1/1.
Imagine you add a big constant to z. One millio
Hi Remi,
I understood this. What I find strange is that using -1/1 should be
equivalent to using 0/1, but your algorithm behaves differently: it
ignores lost games with 0/1, and uses them with -1/1.
Imagine you add a big constant to z. One million, say. This does not
change the problem. Y
David Silver wrote:
Sorry, I should have made it clear that this assumes that we are
treating black wins as z=1 and white wins as z=0.
In this special case, the gradient is the average of games in which
black won.
But yes, more generally you need to include games won by both sides.
The algori
Hi Remi,
This is strange: you do not take lost playouts into consideration.
I believe there is a problem with your estimation of the gradient.
Suppose for instance that you count z = +1 for a win, and z = -1 for
a loss. Then you would take lost playouts into consideration. This
makes me a
Rémi Coulom wrote:
The fundamental problem here may be that your estimate of the gradient
is biased by the playout policy. You should probably sample X(s)
uniformly at random to have an unbiased estimator. Maybe this can be
fixed with importance sampling, and then you may get a formula that is
David Silver wrote:
2. Run another N simulations, average the value of psi(s,a) over
all positions and moves in games that black won (call this g)
This is strange: you do not take lost playouts into consideration.
I believe there is a problem with your estimation of the gradient.
Suppo
David Silver wrote:
>A: Estimate value V* of every position in a training set, using deep
>rollouts.
>
>B: Repeat, for each position in the training set
> 1. Run M simulations, estimate value of position (call this V)
> 2. Run another N simulations, average the value of psi(s,a) over
Hi Remi,
What komi did you use for 5x5 and 6x6 ?
I used 7.5 komi for both board sizes.
I find it strange that you get only 70 Elo points from supervised
learning over uniform random. Don't you have any feature for atari
extension ? This one alone should improve strength immensely (extend
stri
David Silver wrote:
Hi Michael,
But one thing confuses me: You are using the value from Fuego's 10k
simulations as an approximation of the actual value of the position.
But isn't the actual
value of the position either a win or a loss? On such small boards,
can't you assume that Fuego is ab
Hi Yamato,
Could you give us the source code which you used? Your algorithm is
too complicated, so it would be very helpful if possible.
Actually I think the source code would be much harder to understand!
It is written inside RLGO, and makes use of a substantial existing
framework that w
But I'm only trying to make a point, not pin the analogy down perfectly.
Naturally the stronger the player, the more likely his moves will conform to
the level of the top players.
The basic principle is that the longer the contest, the more opportunities a
strong player has to demonstrate his supe
David Silver wrote:
>> because the previous approaches were not optimized for such a small
>> boards.
>
>I'm not sure what you mean here? The supervised learning and
>reinforcement learning approaches that we compared against are both
>trained on the small boards, i.e. the pattern weights are
ssociation of men who do violence to the rest of us."
- Leo Tolstoy
From: steve uurtamo
To: computer-go
Sent: Tuesday, April 28, 2009 5:09:20 PM
Subject: Re: [computer-go] Monte-Carlo Simulation Balancing
also, i'm not sure that a lot of most amate
also, i'm not sure that a lot of most amateurs' moves are very
good. the spectrum of bad moves is wide, it's just that it takes
someone many stones stronger to severely punish small differences
between good and nearly-good moves. among players of relatively
similar strength, these differences wil
A simplistic model that helps explain this is golf. On a single hole, even
a casual golfer has a realistic chance of out-golfing Tiger Woods. Tiger
occasionally shoots a 1 over par on some hole and even weak amateurs
occasionally par or even birdie a hole.It's not going to happen a lot,
but
I noticed that, in general, changes in the playout policy have a much
bigger impact on larger boards than on smaller boards.
Rémi
I think rating differences are emplified on larger boards. This is easy to
see if you think about it this way :
Somehow a 19x19 board is like 4 9x9 boards. Let u
David Silver wrote:
I don't think the 200+ Elo improvement is so impressive
I agree that it would be much more impressive to report positive
results on larger boards. But perhaps it is already interesting that
tuning the balance of the simulation policy can make a big difference
on small boa
Hi Yamato,
I like you idea, but why do you use only 5x5 and 6x6 Go?
1. Our second algorithm, two-ply simulation balancing, requires a
training set of two-ply rollouts. Rolling out every position from a
complete two-ply search is very expensive on larger board sizes, so we
would probably
David Silver wrote:
>Please find attached my ICML paper with Gerry Tesauro on automatically
>learning a simulation policy for Monte-Carlo Go. Our preliminary
>results show a 200+ Elo improvement over previous approaches, although
>our experiments were restricted to simple Monte-Carlo search w
Hi Michael,
But one thing confuses me: You are using the value from Fuego's 10k
simulations as an approximation of the actual value of the
position. But isn't the actual
value of the position either a win or a loss? On such small boards,
can't you assume that Fuego is able to correctly de
My favorite part:
"One natural idea is to use the learned simulation policy in Monte-Carlo search, and
generate new deep search values, in an iterative cycle."
But one thing confuses me: You are using the value from Fuego's 10k simulations as an approximation of the actual value of the position.
Hi Remi,
If I understand correctly, your method makes your program 250 Elo
points
stronger than my pattern-learning algorithm on 5x5 and 6x6, by just
learning better weights.
Yes, although this is just in a very simple MC setting.
Also we did not compare directly to the algorithm you used
David Silver wrote:
Hi everyone,
Please find attached my ICML paper with Gerry Tesauro on automatically
learning a simulation policy for Monte-Carlo Go. Our preliminary
results show a 200+ Elo improvement over previous approaches, although
our experiments were restricted to simple Monte-Carlo
Finally! I guess you can add this technique to your list, Lukasz.
David Silver wrote:
Hi everyone,
Please find attached my ICML paper with Gerry Tesauro on automatically
learning a simulation policy for Monte-Carlo Go. Our preliminary results
show a 200+ Elo improvement over previous approa
39 matches
Mail list logo