You are absolutely right, as I was in understanding RL policy network
mode I thought, everything is about this, sorry

Am 21.11.2016 um 15:22 schrieb Gian-Carlo Pascutto:
> On 20-11-16 11:16, Detlef Schmicker wrote:
>> Hi Hiroshi,
>>
>>> Now I'm making 13x13 selfplay games like AlphaGo paper. 1. make a
>>> position by Policy(SL) probability from initial position. 2. play a
>>> move uniformly at random from available moves. 3. play left moves
>>> by Policy(RL) to the end. (2) means it plays very bad move usually.
>>> Maybe it is because making completely different position? I don't
>>> understand why this (2) is
>> needed.
>>
>> I did not read the alphago paper like this.
>>
>> I read it uses the RL policy the "usual" way (I would say it means
>> something like randomizing with the net probabilities for the best 5
>> moves or so)
>>
>> but randomize the opponent uniformaly, meaning the net values of the
>> opponent are taken from an earlier step in the reinforcement learning.
>>
>> Meaning e.g.
>>
>> step 10000 playing against step 7645 in the reinforcement history?
>>
>> Or did I understand you wrong?
> 
> You are confusing the Policy Network RL procedure with the Value Network
> data production.
> 
> For the Value Network indeed the procedure is as described, with one
> move at time U being uniformly sampled from {1,361} until it is legal. I
> think it's because we're not interested (only) in playing good moves,
> but also analyzing as diverse as possible positions to learn whether
> they're won or lost. Throwing in one totally random move vastly
> increases the diversity and the number of odd positions the network
> sees, while still not leading to totally nonsensical positions.
> 
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to