-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Hiroshi,

> Now I'm making 13x13 selfplay games like AlphaGo paper. 1. make a
> position by Policy(SL) probability from initial position. 2. play a
> move uniformly at random from available moves. 3. play left moves
> by Policy(RL) to the end. (2) means it plays very bad move usually.
> Maybe it is because making completely different position? I don't
> understand why this (2) is
needed.

I did not read the alphago paper like this.

I read it uses the RL policy the "usual" way (I would say it means
something like randomizing with the net probabilities for the best 5
moves or so)

but randomize the opponent uniformaly, meaning the net values of the
opponent are taken from an earlier step in the reinforcement learning.

Meaning e.g.

step 10000 playing against step 7645 in the reinforcement history?

Or did I understand you wrong?


Detlef
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iQIcBAEBAgAGBQJYMXf5AAoJEInWdHg+Znf4MI4QAJw3EAKpjbkQKtrO/3gFDazy
ASbIAChXNQEfmqOyq40d/PerlUUat+xkMJInlmnE+qwkrpM1ityTKT6Q8Yee1TWW
HmjRj4CQ4qxXWEGwWdIY4n2P36cz3x6xiItM9v7MJ0/p/WXJJyhH0MgmXpJuFJN5
rMxqol6b0ilr29UL5nY4L8pMsBI9dtOI0+DYg/eNKtg9lOfJEYfByGP7BENQV0GD
sqMKmMfHgnQ7swZhIm4nLB4R78m4GJUEFsvTHMm8rOyFJoulwRvaBYRfdtu3x4kF
kigJ3VmfAcowVUER7fDjL4/KzWcVlUGEw0gBTIK+xIheqIglLIHFLToM+FDwM3T8
poOI+f2tXWcPu1V0r85rpVFJ6nBrPey0pai0GcEL6I5N+ooG7fb5XorX7TAeOjhH
ASuzlUO2lBpdSjcpVX+9l5nniXfdM6zNE6XPZW6JEOmIHEkjo7kaREd91I+GRhKW
l6cRuVhAiudix3j31+PS7qIUdeKodipTR5qqfdxeYljiSQwJgw5tbucD6Db8/HJh
Jg9PQvdfAnTAj93jWxE/dSsbB7GOy1vThJiQcSNP1PNcw4l62hwsSZC9MCkEzOFk
Sqb8D/8eMHoiTwZMwjZN+GdDy9XoFFGwVWG1HEHcgZO3hhN8ntR2D71Y8grbKFKu
LpuegNW6/ChRCRAo73k/
=RR28
-----END PGP SIGNATURE-----
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to