Made a start here: https://github.com/hughperkins/kgsgo-dataset-preprocessor - downlaods the html page,with list of download zip urls from kgs - downlaods the zip files, based on html page - unzips the zip files - loads each sgf file in turn - uses gomill to parse the sgf file, check it is 19x19, and no handicap
... and on the other hand created some classes to handle the mechanics of a Go game: - GoBoard: represents a go board, can apply moves, handles captures, detects Ko, contains GoStrings - a GoString is a string of contiguous pieces of the same color. Holds also a full list of all liberties - Bag2d is a double-indexed bag of 2d locations: - given any location, know whether it is in the bag or not, in O(1) - can iterate the locatinos, O(1) per location iterated - can erase a location in O(1) ... so now just need to link these together, and pump out the binary data file On 1/11/15, Hugh Perkins <hughperk...@gmail.com> wrote: > Thinking about datasets for CNN training, of which I lack one > currently :-P Hence I've been using MNIST , but also since MNIST > results are widely known, and if I train with a couple of layers, and > get 12% accuracy, obviously I know I have to fix something :-P > > But now, my network consistently gets up into the 97-98%s for mnist, > even with just a layer or two, and speed is ok-ish, and probably want > to start running training against 19x19 boards instead of 28x28. The > optimization is different. On my laptop, an OpenCL workgroup can hold > a 19x19 board, with one thread per intersection, but 28x28 threads > would exceed the workgroup size. Unless I loop, or break into two > workgroups, or something else equally buggy, slow, and > high-maintenance :-P > > So, I could crop the mnist boards down to 19x19, but whoever heard of > training on 19x19 mnist boards? > > So, possibly time to start hitting actual Go boards. Many other > datasets are available in a standardized generic format, ready to feed > into any machine learning algorithm. For example, those provided at > libsvm website http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ > , or mnist, yann.lecun.com/exdb/mnist/ . The go datasets are not > (yet) available in any kind of standard format so I'm thinking, maybe > that could be useful to do so? But there are three challenges: > > 1. what data to store? Clark and Storkey planes? Raw boards? Maddison > et al planes? Something else? For now, my answer is: something > corresponding to an actual existing paper, and Clark and Storkey's > network has the advantage of costing less than 2000usd to train, so > that's my answer to 'what data to store?' > 2. copyright. gogod is apparently a. copyrighted as a collection b. > compiled by hand as a result of painstakingly going through each game, > move by move, and entering into the computer, one move at a time. > Probably not really likely that one could put this, even preprocessed, > as a standard dataset? However, the good news is that the gks dataset > seems publically available, and big, maybe just use that? > 3. size ..... this is where I dont have an answer yet. > - 8 million states, where each state is 8 planes * 351 locations = 20GB > :-P > - the raw sgfs only take 3KB per game, for a total of about 80MB, > but needs a lot of preprocessing, and if one were to feed each game > through, in order, might not be the best sequence for effective > learning? > - current idea: encode one column through the planes as a single > byte? For Clark and Storkey they only have 8 planes, so this should > be easy enough :-) > - which would be 2.6GB instead > - but still kind of large, to put on my web hosting :-P > > I suppose a compromise could be needed, which would also solve problem > number 1 somewhat, of just providing a tool, eg in Python, or C, or > Cython, which will take the kgs downloads, possibly the gogod > download, and transform it into a 2.6GB dataset, ready for training, > and possibly pre-shuffled? > > But this would be quite non-standard, although this is not unheard of, > eg for imagenet, there is a devkit > http://image-net.org/challenges/LSVRC/2011/index#devkit > > Maybe I will create a github project, like 'kgs-dataset-preprocessor'? > Could work something like ?: > > python kgs-dataset-preprocessor.py [targetdirectory] > > Results: > - the datasets are downloaded from http://u-go.net/gamerecords/ > - decompressed > - loaded once at a time, and processed into a 2.5GB datafile, in > sequence (clients can handle shuffling themselves I suppose?) > > Thoughts? > > Hugh > _______________________________________________ Computer-go mailing list Computer-go@computer-go.org http://computer-go.org/mailman/listinfo/computer-go