-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Hiroshi,
> Now I'm making 13x13 selfplay games like AlphaGo paper. 1. make a > position by Policy(SL) probability from initial position. 2. play a > move uniformly at random from available moves. 3. play left moves > by Policy(RL) to the end. (2) means it plays very bad move usually. > Maybe it is because making completely different position? I don't > understand why this (2) is needed. I did not read the alphago paper like this. I read it uses the RL policy the "usual" way (I would say it means something like randomizing with the net probabilities for the best 5 moves or so) but randomize the opponent uniformaly, meaning the net values of the opponent are taken from an earlier step in the reinforcement learning. Meaning e.g. step 10000 playing against step 7645 in the reinforcement history? Or did I understand you wrong? Detlef -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAEBAgAGBQJYMXf5AAoJEInWdHg+Znf4MI4QAJw3EAKpjbkQKtrO/3gFDazy ASbIAChXNQEfmqOyq40d/PerlUUat+xkMJInlmnE+qwkrpM1ityTKT6Q8Yee1TWW HmjRj4CQ4qxXWEGwWdIY4n2P36cz3x6xiItM9v7MJ0/p/WXJJyhH0MgmXpJuFJN5 rMxqol6b0ilr29UL5nY4L8pMsBI9dtOI0+DYg/eNKtg9lOfJEYfByGP7BENQV0GD sqMKmMfHgnQ7swZhIm4nLB4R78m4GJUEFsvTHMm8rOyFJoulwRvaBYRfdtu3x4kF kigJ3VmfAcowVUER7fDjL4/KzWcVlUGEw0gBTIK+xIheqIglLIHFLToM+FDwM3T8 poOI+f2tXWcPu1V0r85rpVFJ6nBrPey0pai0GcEL6I5N+ooG7fb5XorX7TAeOjhH ASuzlUO2lBpdSjcpVX+9l5nniXfdM6zNE6XPZW6JEOmIHEkjo7kaREd91I+GRhKW l6cRuVhAiudix3j31+PS7qIUdeKodipTR5qqfdxeYljiSQwJgw5tbucD6Db8/HJh Jg9PQvdfAnTAj93jWxE/dSsbB7GOy1vThJiQcSNP1PNcw4l62hwsSZC9MCkEzOFk Sqb8D/8eMHoiTwZMwjZN+GdDy9XoFFGwVWG1HEHcgZO3hhN8ntR2D71Y8grbKFKu LpuegNW6/ChRCRAo73k/ =RR28 -----END PGP SIGNATURE----- _______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go