Thinking about datasets for CNN training, of which I lack one
currently :-P  Hence I've been using MNIST , but also since MNIST
results are widely known, and if I train with a couple of layers, and
get 12% accuracy, obviously I know I have to fix something :-P

But now, my network consistently gets up into the 97-98%s for mnist,
even with just a layer or two, and speed is ok-ish, and probably want
to start running training against 19x19 boards instead of 28x28.  The
optimization is different.  On my laptop, an OpenCL workgroup can hold
a 19x19 board, with one thread per intersection, but 28x28 threads
would exceed the workgroup size.  Unless I loop, or break into two
workgroups, or something else equally buggy, slow, and
high-maintenance :-P

So, I could crop the mnist boards down to 19x19, but whoever heard of
training on 19x19 mnist boards?

So, possibly time to start hitting actual Go boards.  Many other
datasets are available in a standardized generic format, ready to feed
into any machine learning algorithm.  For example, those provided at
libsvm website http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
, or mnist, yann.lecun.com/exdb/mnist/ .  The go datasets are not
(yet) available in any kind of standard format so I'm thinking, maybe
that could be useful to do so?  But there are three challenges:

1. what data to store?  Clark and Storkey planes? Raw boards? Maddison
et al planes? Something else?  For now, my answer is: something
corresponding to an actual existing paper, and Clark and Storkey's
network has the advantage of costing less than 2000usd to train, so
that's my answer to 'what data to store?'
2. copyright.  gogod is apparently a. copyrighted as a collection b.
compiled by hand as a result of painstakingly going through each game,
move by move, and entering into the computer, one move at a time.
Probably not really likely that one could put this, even preprocessed,
as a standard dataset?  However, the good news is that the gks dataset
seems publically available, and big, maybe just use that?
3. size ..... this is where I dont have an answer yet.
    - 8 million states, where each state is 8 planes * 351 locations = 20GB :-P
    - the raw sgfs only take 3KB per game, for a total of about 80MB,
but needs a lot of preprocessing, and if one were to feed each game
through, in order, might not be the best sequence for effective
learning?
    - current idea: encode one column through the planes as a single
byte?  For Clark and Storkey they only have 8 planes, so this should
be easy enough :-)
    - which would be 2.6GB instead
    - but still kind of large, to put on my web hosting :-P

I suppose a compromise could be needed, which would also solve problem
number 1 somewhat, of just providing a tool, eg in Python, or C, or
Cython, which will take the kgs downloads, possibly the gogod
download, and transform it into a 2.6GB dataset, ready for training,
and possibly pre-shuffled?

But this would be quite non-standard, although this is not unheard of,
eg for imagenet, there is a devkit
http://image-net.org/challenges/LSVRC/2011/index#devkit

Maybe I will create a github project, like 'kgs-dataset-preprocessor'?
 Could work something like ?:

   python kgs-dataset-preprocessor.py [targetdirectory]

Results:
- the datasets are downloaded from http://u-go.net/gamerecords/
- decompressed
- loaded once at a time, and processed into a 2.5GB datafile, in
sequence (clients can handle shuffling themselves I suppose?)

Thoughts?

Hugh
_______________________________________________
Computer-go mailing list
[email protected]
http://computer-go.org/mailman/listinfo/computer-go

Reply via email to