Re: [math]New feature MiniBatchKMeansClusterer

CT Tue, 04 Feb 2020 00:39:55 -0800

Hi
  Sorry for reply late, because of Spring Festival and SARI.


> Some remarks:

> * I didn't get why the "KMeansPlusPlusCentroidInitializer" class
> does not call the existing "KMeansPlusPlusClusterer".
> Code seems duplicated: As there is a case for reuse, the currently
> "private" centroid initialization code should be factored out.
This is alpha version for discuss the "MiniBatchKMeansClusterer" algorithm,
and when "centroidOf" is extracted from "KMeansPlusPlusClusterer",
the "KMeansPlusPlusClusterer" is not "KMeansPlusPlusClusterer" anymore but 
"KMeansClusterer",
this is a significant change, so I did not reactor it.


> * In "CentroidInitializer", I'd rename "chooseCentroids" to emphasize
> that a computation is performed (as opposed to selecting from an
> existing data structure).
It is extract from "KMeansPlusPlusClusterer.centroidOf", should remain be 
"centroidOf"?
The subclass "RandomCentroidInitializer" and 
"KMeansPlusPlusCentroidInitializer" indicate the algorithm used.

> * Not convinced that there should be so many constructors (in most
> cases, it makes no sense to pick default values that are likely to
> be heavily application-dependent.
I can add more constructors.
I'd like a builder class more than constructors, but does not meet the 
historical code style.

> * Input should generally be validated: e.g. the maximum number of
> iterations should not be changed unwittingly; rather, an exception
> should be raised if the user passed a negative value.
Thanks for your advices, I will improve these.

> Could be nice to illustrate (not just with a picture, but in a table
> with entries average over several runs) the differences in result
> between the implementations, using various setups (number of
> clusters, stopping criterion, etc.).
I will make more tests, include benchmarks.
It is a challenge for me to generate the various kinds of test data,
could anybody supply me the test data of this comparsion: 
http://commons.apache.org/proper/commons-math/userguide/ml.html


> "MT_64" is probably not the best default.  And this is one of the
> parameters from which there should not be a default IMO.
I will do more tests

> [Note: there are spurious characters in your message (see e.g. the
> paragraph quoted just above) that make it difficult to read.]
I had well format my mail in my mail box, it may been changed by the Mail List 
service.
I will try various kinds of mail editor. It will helpful if you told me which 
mail editor is work well with the ML.

> Best regards,
> Gilles


------------------ Original ------------------
From:  "Gilles Sadowski";<gillese...@gmail.com>;
Date:  Jan 20, 2020
To:  "Commons Developers List"<dev@commons.apache.org>; 

Subject:  Re: [math]New feature MiniBatchKMeansClusterer



Hi.

2020-01-20 3:08 UTC+01:00, CT <chentao...@qq.com>:
> Hi,&nbsp; In my picture search project, I need a cluster algorithm to narrow
> the dataset, for accelerate the search on millions of pictures.
> &nbsp; First we use python+pytorch+kmean, with the growing data from
> thousands to millions, the KMeans clustering became slower and
> slower(seconds to minutes), then we find MiniBatchKMeans could amazing
> finish the clustering in 1~2 seconds on millions of data.

Impressive improvement.

> &nbsp; Meanwhile we still faced the insufficient concurrent capacity of
> python, so we switch to kotlin on jvm.
> &nbsp; But there did not a MinibatchKMeans algorithm in jvm yet, so I wrote
> one in kotlin, refer to the (python)sklearn MinibatchKMeans and Apache
> Commons Math(Deeplearning4j was also considered, but it is too slow because
> of ND4j's design).
>
>
> &nbsp; I'd like to contribute it to Apache Commons Math,

Thanks!

> and I wrote a java
> version:
> https://github.com/chentao106/commons-math/tree/feature-MiniBatchKMeans

Some remarks:

* I didn't get why the "KMeansPlusPlusCentroidInitializer" class
does not call the existing "KMeansPlusPlusClusterer".
Code seems duplicated: As there is a case for reuse, the currently
"private" centroid initialization code should be factored out.

* In "CentroidInitializer", I'd rename "chooseCentroids" to emphasize
that a computation is performed (as opposed to selecting from an
existing data structure).

* Not convinced that there should be so many constructors (in most
cases, it makes no sense to pick default values that are likely to
be heavily application-dependent.

* Input should generally be validated: e.g. the maximum number of
iterations should not be changed unwittingly; rather, an exception
should be raised if the user passed a negative value.

>
> &nbsp; From my test(Kotlin version), it is very fast, but gives slightly
> different results with&nbsp;KMeans++ in most case, but sometimes has big
> different(May be affected by the randomness of the mini batch):

Could be nice to illustrate (not just with a picture, but in a table
with entries average over several runs) the differences in result
between the implementations, using various setups (number of
clusters, stopping criterion, etc.).

>
>
>
> Some bad case:
>
> It even worse when I use RandomSource.create(RandomSource.MT_64, 0)&nbsp;for
> the random generator ┐(?-｀)┌.

"MT_64" is probably not the best default.  And this is one of the
parameters from which there should not be a default IMO.

[Note: there are spurious characters in your message (see e.g. the
paragraph quoted just above) that make it difficult to read.]

Best regards,
Gilles

>
>
> My brief understanding of MiniBatchKMeans:
> Use a partial points in initialize cluster centers, and random mini batch in
> training iterations.
> It can finish in few seconds when clustering millions of data, and has few
> differences between KMeans.
>
>
> More information about MiniBatchKMeans
> &nbsp; https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
> &nbsp;
> https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math]New feature MiniBatchKMeansClusterer

Reply via email to