Hi. 2020-01-20 3:08 UTC+01:00, CT <chentao...@qq.com>: > Hi, In my picture search project, I need a cluster algorithm to narrow > the dataset, for accelerate the search on millions of pictures. > First we use python+pytorch+kmean, with the growing data from > thousands to millions, the KMeans clustering became slower and > slower(seconds to minutes), then we find MiniBatchKMeans could amazing > finish the clustering in 1~2 seconds on millions of data.
Impressive improvement. > Meanwhile we still faced the insufficient concurrent capacity of > python, so we switch to kotlin on jvm. > But there did not a MinibatchKMeans algorithm in jvm yet, so I wrote > one in kotlin, refer to the (python)sklearn MinibatchKMeans and Apache > Commons Math(Deeplearning4j was also considered, but it is too slow because > of ND4j's design). > > > I'd like to contribute it to Apache Commons Math, Thanks! > and I wrote a java > version: > https://github.com/chentao106/commons-math/tree/feature-MiniBatchKMeans Some remarks: * I didn't get why the "KMeansPlusPlusCentroidInitializer" class does not call the existing "KMeansPlusPlusClusterer". Code seems duplicated: As there is a case for reuse, the currently "private" centroid initialization code should be factored out. * In "CentroidInitializer", I'd rename "chooseCentroids" to emphasize that a computation is performed (as opposed to selecting from an existing data structure). * Not convinced that there should be so many constructors (in most cases, it makes no sense to pick default values that are likely to be heavily application-dependent. * Input should generally be validated: e.g. the maximum number of iterations should not be changed unwittingly; rather, an exception should be raised if the user passed a negative value. > > From my test(Kotlin version), it is very fast, but gives slightly > different results with KMeans++ in most case, but sometimes has big > different(May be affected by the randomness of the mini batch): Could be nice to illustrate (not just with a picture, but in a table with entries average over several runs) the differences in result between the implementations, using various setups (number of clusters, stopping criterion, etc.). > > > > Some bad case: > > It even worse when I use RandomSource.create(RandomSource.MT_64, 0) for > the random generator ┐(´-`)┌. "MT_64" is probably not the best default. And this is one of the parameters from which there should not be a default IMO. [Note: there are spurious characters in your message (see e.g. the paragraph quoted just above) that make it difficult to read.] Best regards, Gilles > > > My brief understanding of MiniBatchKMeans: > Use a partial points in initialize cluster centers, and random mini batch in > training iterations. > It can finish in few seconds when clustering millions of data, and has few > differences between KMeans. > > > More information about MiniBatchKMeans > https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf > > https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org