You are absolutely right, as I was in understanding RL policy network mode I thought, everything is about this, sorry
Am 21.11.2016 um 15:22 schrieb Gian-Carlo Pascutto: > On 20-11-16 11:16, Detlef Schmicker wrote: >> Hi Hiroshi, >> >>> Now I'm making 13x13 selfplay games like AlphaGo paper. 1. make a >>> position by Policy(SL) probability from initial position. 2. play a >>> move uniformly at random from available moves. 3. play left moves >>> by Policy(RL) to the end. (2) means it plays very bad move usually. >>> Maybe it is because making completely different position? I don't >>> understand why this (2) is >> needed. >> >> I did not read the alphago paper like this. >> >> I read it uses the RL policy the "usual" way (I would say it means >> something like randomizing with the net probabilities for the best 5 >> moves or so) >> >> but randomize the opponent uniformaly, meaning the net values of the >> opponent are taken from an earlier step in the reinforcement learning. >> >> Meaning e.g. >> >> step 10000 playing against step 7645 in the reinforcement history? >> >> Or did I understand you wrong? > > You are confusing the Policy Network RL procedure with the Value Network > data production. > > For the Value Network indeed the procedure is as described, with one > move at time U being uniformly sampled from {1,361} until it is legal. I > think it's because we're not interested (only) in playing good moves, > but also analyzing as diverse as possible positions to learn whether > they're won or lost. Throwing in one totally random move vastly > increases the diversity and the number of odd positions the network > sees, while still not leading to totally nonsensical positions. > _______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go