Re: [Math] kmeans++: decouple EM LLoyd's iterations and initial seeding of clustering centers.

Artem Barger Wed, 01 Jun 2016 07:54:52 -0700

On Wed, Jun 1, 2016 at 5:46 PM, Gilles <gil...@harfang.homelinux.org> wrote:


> On Wed, 1 Jun 2016 17:24:47 +0300, Artem Barger wrote:
>
>> 
>> On Tue, May 31, 2016 at 4:04 PM, Artem Barger <ar...@bargr.net> wrote:
>>
>> Hi,
>>>
>>> Current implementation of kmeans within CM framework, inherently uses
>>> algorithm published by  Arthur, David, and Sergei Vassilvitskii.
>>> "k-means++: The advantages of careful seeding." *Proceedings of the
>>> eighteenth annual ACM-SIAM symposium on Discrete algorithms*. Society for
>>> Industrial and Applied Mathematics, 2007. While there other alternative
>>> algorithms for initial seeding is available, for instance:
>>>
>>> 1. Random initialization (each center picked uniformly at random).
>>> 2. Canopy https://en.wikipedia.org/wiki/Canopy_clustering_algorithm
>>> 3. Bicriteria  Feldman, Dan, et al. "Bi-criteria linear-time
>>> approximations for generalized k-mean/median/center." *Proceedings of the
>>> twenty-third annual symposium on Computational geometry*. ACM, 2007.
>>>
>>> While I understand that kmeans++ is preferable option, others could be
>>> also used for testing, trials and evaluations as well.
>>>
>>> I'd like to propose to separate logic of seeding and clustering to
>>> increase flexibility for kmeans clustering. Would be glad to hear your
>>> comments, pros/cons or rejections...
>>>
>>>
>>> I've found "Scalable KMeans" or kmeans|| as referred in the
>> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, which
>> provides
>> parallelizable seeding procedure.
>> I guess this might serve as additional +1 vote for doing separation
>> between seeding and LLoyd's iterations in current implementations of
>> kmeans.
>>
>
> I guess that, around here, you are the expert about these algorithms...
>

Thanks for providing me a credit, while I'm still not an expert :)
I'd say "have reasonable knowledge"...

So go ahead and (re)write the code as you see fit, while still taking into
> account that the code should be self-documenting as much as possible.
> And OO (since this is Java).
>

Will do my best.


> If you are up for a major refactoring (e.g. for sparse data), I'd suggest
> to do it in a new package, so that we can easily compare the old and new
> codes (e.g. run the tests).
>

I'm working on it, this is not as trivial as I initially though, described
several concerns
in other threads, one of them for example the usage of generic parameter T
and some
assumption which enforced by Clusterable interface.

And if you contemplate parallelization, I wonder whether the issue of
> switching to Java 8 might not have to be resolved first.
>

I'm absolutely positive switching to Java 8, as I can see many benefits
both API and
performance 

wise. What it is the proper process to move for Java 8?


Best,
    Artem Barger.

Re: [Math] kmeans++: decouple EM LLoyd's iterations and initial seeding of clustering centers.

Reply via email to