Hi Sorry for reply late, because of Spring Festival and SARI.
> Some remarks: > * I didn't get why the "KMeansPlusPlusCentroidInitializer" class > does not call the existing "KMeansPlusPlusClusterer". > Code seems duplicated: As there is a case for reuse, the currently > "private" centroid initialization code should be factored out. This is alpha version for discuss the "MiniBatchKMeansClusterer" algorithm, and when "centroidOf" is extracted from "KMeansPlusPlusClusterer", the "KMeansPlusPlusClusterer" is not "KMeansPlusPlusClusterer" anymore but "KMeansClusterer", this is a significant change, so I did not reactor it. > * In "CentroidInitializer", I'd rename "chooseCentroids" to emphasize > that a computation is performed (as opposed to selecting from an > existing data structure). It is extract from "KMeansPlusPlusClusterer.centroidOf", should remain be "centroidOf"? The subclass "RandomCentroidInitializer" and "KMeansPlusPlusCentroidInitializer" indicate the algorithm used. > * Not convinced that there should be so many constructors (in most > cases, it makes no sense to pick default values that are likely to > be heavily application-dependent. I can add more constructors. I'd like a builder class more than constructors, but does not meet the historical code style. > * Input should generally be validated: e.g. the maximum number of > iterations should not be changed unwittingly; rather, an exception > should be raised if the user passed a negative value. Thanks for your advices, I will improve these. > Could be nice to illustrate (not just with a picture, but in a table > with entries average over several runs) the differences in result > between the implementations, using various setups (number of > clusters, stopping criterion, etc.). I will make more tests, include benchmarks. It is a challenge for me to generate the various kinds of test data, could anybody supply me the test data of this comparsion: http://commons.apache.org/proper/commons-math/userguide/ml.html > "MT_64" is probably not the best default. And this is one of the > parameters from which there should not be a default IMO. I will do more tests > [Note: there are spurious characters in your message (see e.g. the > paragraph quoted just above) that make it difficult to read.] I had well format my mail in my mail box, it may been changed by the Mail List service. I will try various kinds of mail editor. It will helpful if you told me which mail editor is work well with the ML. > Best regards, > Gilles ------------------ Original ------------------ From: "Gilles Sadowski";<gillese...@gmail.com>; Date: Jan 20, 2020 To: "Commons Developers List"<dev@commons.apache.org>; Subject: Re: [math]New feature MiniBatchKMeansClusterer Hi. 2020-01-20 3:08 UTC+01:00, CT <chentao...@qq.com>: > Hi, In my picture search project, I need a cluster algorithm to narrow > the dataset, for accelerate the search on millions of pictures. > First we use python+pytorch+kmean, with the growing data from > thousands to millions, the KMeans clustering became slower and > slower(seconds to minutes), then we find MiniBatchKMeans could amazing > finish the clustering in 1~2 seconds on millions of data. Impressive improvement. > Meanwhile we still faced the insufficient concurrent capacity of > python, so we switch to kotlin on jvm. > But there did not a MinibatchKMeans algorithm in jvm yet, so I wrote > one in kotlin, refer to the (python)sklearn MinibatchKMeans and Apache > Commons Math(Deeplearning4j was also considered, but it is too slow because > of ND4j's design). > > > I'd like to contribute it to Apache Commons Math, Thanks! > and I wrote a java > version: > https://github.com/chentao106/commons-math/tree/feature-MiniBatchKMeans Some remarks: * I didn't get why the "KMeansPlusPlusCentroidInitializer" class does not call the existing "KMeansPlusPlusClusterer". Code seems duplicated: As there is a case for reuse, the currently "private" centroid initialization code should be factored out. * In "CentroidInitializer", I'd rename "chooseCentroids" to emphasize that a computation is performed (as opposed to selecting from an existing data structure). * Not convinced that there should be so many constructors (in most cases, it makes no sense to pick default values that are likely to be heavily application-dependent. * Input should generally be validated: e.g. the maximum number of iterations should not be changed unwittingly; rather, an exception should be raised if the user passed a negative value. > > From my test(Kotlin version), it is very fast, but gives slightly > different results with KMeans++ in most case, but sometimes has big > different(May be affected by the randomness of the mini batch): Could be nice to illustrate (not just with a picture, but in a table with entries average over several runs) the differences in result between the implementations, using various setups (number of clusters, stopping criterion, etc.). > > > > Some bad case: > > It even worse when I use RandomSource.create(RandomSource.MT_64, 0) for > the random generator ┐(?-`)┌. "MT_64" is probably not the best default. And this is one of the parameters from which there should not be a default IMO. [Note: there are spurious characters in your message (see e.g. the paragraph quoted just above) that make it difficult to read.] Best regards, Gilles > > > My brief understanding of MiniBatchKMeans: > Use a partial points in initialize cluster centers, and random mini batch in > training iterations. > It can finish in few seconds when clustering millions of data, and has few > differences between KMeans. > > > More information about MiniBatchKMeans > https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf > > https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org