Made a start here: https://github.com/hughperkins/kgsgo-dataset-preprocessor
- downlaods the html page,with list of download zip urls from kgs
- downlaods the zip files, based on html page
- unzips the zip files
- loads each sgf file in turn
- uses gomill to parse the sgf file, check it is 19x19, and no handicap

... and on the other hand created some classes to handle the mechanics
of a Go game:
- GoBoard: represents a go board, can apply moves, handles captures,
detects Ko, contains GoStrings
- a GoString is a string of contiguous pieces of the same color.
Holds also a full list of all liberties
- Bag2d is a double-indexed bag of 2d locations:
   - given any location, know whether it is in the bag or not, in O(1)
   - can iterate the locatinos, O(1) per location iterated
   - can erase a location in O(1)

... so now just need to link these together, and pump out the binary data file


On 1/11/15, Hugh Perkins <hughperk...@gmail.com> wrote:
> Thinking about datasets for CNN training, of which I lack one
> currently :-P  Hence I've been using MNIST , but also since MNIST
> results are widely known, and if I train with a couple of layers, and
> get 12% accuracy, obviously I know I have to fix something :-P
>
> But now, my network consistently gets up into the 97-98%s for mnist,
> even with just a layer or two, and speed is ok-ish, and probably want
> to start running training against 19x19 boards instead of 28x28.  The
> optimization is different.  On my laptop, an OpenCL workgroup can hold
> a 19x19 board, with one thread per intersection, but 28x28 threads
> would exceed the workgroup size.  Unless I loop, or break into two
> workgroups, or something else equally buggy, slow, and
> high-maintenance :-P
>
> So, I could crop the mnist boards down to 19x19, but whoever heard of
> training on 19x19 mnist boards?
>
> So, possibly time to start hitting actual Go boards.  Many other
> datasets are available in a standardized generic format, ready to feed
> into any machine learning algorithm.  For example, those provided at
> libsvm website http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
> , or mnist, yann.lecun.com/exdb/mnist/ .  The go datasets are not
> (yet) available in any kind of standard format so I'm thinking, maybe
> that could be useful to do so?  But there are three challenges:
>
> 1. what data to store?  Clark and Storkey planes? Raw boards? Maddison
> et al planes? Something else?  For now, my answer is: something
> corresponding to an actual existing paper, and Clark and Storkey's
> network has the advantage of costing less than 2000usd to train, so
> that's my answer to 'what data to store?'
> 2. copyright.  gogod is apparently a. copyrighted as a collection b.
> compiled by hand as a result of painstakingly going through each game,
> move by move, and entering into the computer, one move at a time.
> Probably not really likely that one could put this, even preprocessed,
> as a standard dataset?  However, the good news is that the gks dataset
> seems publically available, and big, maybe just use that?
> 3. size ..... this is where I dont have an answer yet.
>     - 8 million states, where each state is 8 planes * 351 locations = 20GB
> :-P
>     - the raw sgfs only take 3KB per game, for a total of about 80MB,
> but needs a lot of preprocessing, and if one were to feed each game
> through, in order, might not be the best sequence for effective
> learning?
>     - current idea: encode one column through the planes as a single
> byte?  For Clark and Storkey they only have 8 planes, so this should
> be easy enough :-)
>     - which would be 2.6GB instead
>     - but still kind of large, to put on my web hosting :-P
>
> I suppose a compromise could be needed, which would also solve problem
> number 1 somewhat, of just providing a tool, eg in Python, or C, or
> Cython, which will take the kgs downloads, possibly the gogod
> download, and transform it into a 2.6GB dataset, ready for training,
> and possibly pre-shuffled?
>
> But this would be quite non-standard, although this is not unheard of,
> eg for imagenet, there is a devkit
> http://image-net.org/challenges/LSVRC/2011/index#devkit
>
> Maybe I will create a github project, like 'kgs-dataset-preprocessor'?
>  Could work something like ?:
>
>    python kgs-dataset-preprocessor.py [targetdirectory]
>
> Results:
> - the datasets are downloaded from http://u-go.net/gamerecords/
> - decompressed
> - loaded once at a time, and processed into a 2.5GB datafile, in
> sequence (clients can handle shuffling themselves I suppose?)
>
> Thoughts?
>
> Hugh
>
_______________________________________________
Computer-go mailing list
Computer-go@computer-go.org
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to