Re: [math]New feature MiniBatchKMeansClusterer

Gilles Sadowski Mon, 20 Jan 2020 03:10:30 -0800

Hi.

2020-01-20 3:08 UTC+01:00, CT <chentao...@qq.com>:
> Hi,&nbsp; In my picture search project, I need a cluster algorithm to narrow
> the dataset, for accelerate the search on millions of pictures.
> &nbsp; First we use python+pytorch+kmean, with the growing data from
> thousands to millions, the KMeans clustering became slower and
> slower(seconds to minutes), then we find MiniBatchKMeans could amazing
> finish the clustering in 1~2 seconds on millions of data.


Impressive improvement.

> &nbsp; Meanwhile we still faced the insufficient concurrent capacity of
> python, so we switch to kotlin on jvm.
> &nbsp; But there did not a MinibatchKMeans algorithm in jvm yet, so I wrote
> one in kotlin, refer to the (python)sklearn MinibatchKMeans and Apache
> Commons Math(Deeplearning4j was also considered, but it is too slow because
> of ND4j's design).
>
>
> &nbsp; I'd like to contribute it to Apache Commons Math,

Thanks!

> and I wrote a java
> version:
> https://github.com/chentao106/commons-math/tree/feature-MiniBatchKMeans

Some remarks:

* I didn't get why the "KMeansPlusPlusCentroidInitializer" class
does not call the existing "KMeansPlusPlusClusterer".
Code seems duplicated: As there is a case for reuse, the currently
"private" centroid initialization code should be factored out.

* In "CentroidInitializer", I'd rename "chooseCentroids" to emphasize
that a computation is performed (as opposed to selecting from an
existing data structure).

* Not convinced that there should be so many constructors (in most
cases, it makes no sense to pick default values that are likely to
be heavily application-dependent.

* Input should generally be validated: e.g. the maximum number of
iterations should not be changed unwittingly; rather, an exception
should be raised if the user passed a negative value.

>
> &nbsp; From my test(Kotlin version), it is very fast, but gives slightly
> different results with&nbsp;KMeans++ in most case, but sometimes has big
> different(May be affected by the randomness of the mini batch):

Could be nice to illustrate (not just with a picture, but in a table
with entries average over several runs) the differences in result
between the implementations, using various setups (number of
clusters, stopping criterion, etc.).

>
>
>
> Some bad case:
>
> It even worse when I use RandomSource.create(RandomSource.MT_64, 0)&nbsp;for
> the random generator ┐(´-｀)┌.

"MT_64" is probably not the best default.  And this is one of the
parameters from which there should not be a default IMO.

[Note: there are spurious characters in your message (see e.g. the
paragraph quoted just above) that make it difficult to read.]

Best regards,
Gilles

>
>
> My brief understanding of MiniBatchKMeans:
> Use a partial points in initialize cluster centers, and random mini batch in
> training iterations.
> It can finish in few seconds when clustering millions of data, and has few
> differences between KMeans.
>
>
> More information about MiniBatchKMeans
> &nbsp; https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
> &nbsp;
> https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Re: [math]New feature MiniBatchKMeansClusterer

Reply via email to