Why don’t you make a dataset of the raw board positions, along with code to convert to Clark and Storkey planes? The data will be smaller, people can verify against Clark and Storkey, and they have the data to make their own choices about preprocessing for network inputs.
David > -----Original Message----- > From: Computer-go [mailto:computer-go-boun...@computer-go.org] On > Behalf Of Hugh Perkins > Sent: Sunday, January 11, 2015 12:24 AM > To: computer-go > Subject: [Computer-go] Datasets for CNN training? > > Thinking about datasets for CNN training, of which I lack one currently > :-P Hence I've been using MNIST , but also since MNIST results are > widely known, and if I train with a couple of layers, and get 12% > accuracy, obviously I know I have to fix something :-P > > But now, my network consistently gets up into the 97-98%s for mnist, > even with just a layer or two, and speed is ok-ish, and probably want > to start running training against 19x19 boards instead of 28x28. The > optimization is different. On my laptop, an OpenCL workgroup can hold > a 19x19 board, with one thread per intersection, but 28x28 threads > would exceed the workgroup size. Unless I loop, or break into two > workgroups, or something else equally buggy, slow, and high-maintenance > :-P > > So, I could crop the mnist boards down to 19x19, but whoever heard of > training on 19x19 mnist boards? > > So, possibly time to start hitting actual Go boards. Many other > datasets are available in a standardized generic format, ready to feed > into any machine learning algorithm. For example, those provided at > libsvm website http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ > , or mnist, yann.lecun.com/exdb/mnist/ . The go datasets are not > (yet) available in any kind of standard format so I'm thinking, maybe > that could be useful to do so? But there are three challenges: > > 1. what data to store? Clark and Storkey planes? Raw boards? Maddison > et al planes? Something else? For now, my answer is: something > corresponding to an actual existing paper, and Clark and Storkey's > network has the advantage of costing less than 2000usd to train, so > that's my answer to 'what data to store?' > 2. copyright. gogod is apparently a. copyrighted as a collection b. > compiled by hand as a result of painstakingly going through each game, > move by move, and entering into the computer, one move at a time. > Probably not really likely that one could put this, even preprocessed, > as a standard dataset? However, the good news is that the gks dataset > seems publically available, and big, maybe just use that? > 3. size ..... this is where I dont have an answer yet. > - 8 million states, where each state is 8 planes * 351 locations = > 20GB :-P > - the raw sgfs only take 3KB per game, for a total of about 80MB, > but needs a lot of preprocessing, and if one were to feed each game > through, in order, might not be the best sequence for effective > learning? > - current idea: encode one column through the planes as a single > byte? For Clark and Storkey they only have 8 planes, so this should be > easy enough :-) > - which would be 2.6GB instead > - but still kind of large, to put on my web hosting :-P > > I suppose a compromise could be needed, which would also solve problem > number 1 somewhat, of just providing a tool, eg in Python, or C, or > Cython, which will take the kgs downloads, possibly the gogod download, > and transform it into a 2.6GB dataset, ready for training, and possibly > pre-shuffled? > > But this would be quite non-standard, although this is not unheard of, > eg for imagenet, there is a devkit http://image- > net.org/challenges/LSVRC/2011/index#devkit > > Maybe I will create a github project, like 'kgs-dataset-preprocessor'? > Could work something like ?: > > python kgs-dataset-preprocessor.py [targetdirectory] > > Results: > - the datasets are downloaded from http://u-go.net/gamerecords/ > - decompressed > - loaded once at a time, and processed into a 2.5GB datafile, in > sequence (clients can handle shuffling themselves I suppose?) > > Thoughts? > > Hugh > _______________________________________________ > Computer-go mailing list > Computer-go@computer-go.org > http://computer-go.org/mailman/listinfo/computer-go _______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go