On Wed, Jun 1, 2016 at 5:46 PM, Gilles <gil...@harfang.homelinux.org> wrote:
> On Wed, 1 Jun 2016 17:24:47 +0300, Artem Barger wrote: > >> >> On Tue, May 31, 2016 at 4:04 PM, Artem Barger <ar...@bargr.net> wrote: >> >> Hi, >>> >>> Current implementation of kmeans within CM framework, inherently uses >>> algorithm published by Arthur, David, and Sergei Vassilvitskii. >>> "k-means++: The advantages of careful seeding." *Proceedings of the >>> eighteenth annual ACM-SIAM symposium on Discrete algorithms*. Society for >>> Industrial and Applied Mathematics, 2007. While there other alternative >>> algorithms for initial seeding is available, for instance: >>> >>> 1. Random initialization (each center picked uniformly at random). >>> 2. Canopy https://en.wikipedia.org/wiki/Canopy_clustering_algorithm >>> 3. Bicriteria Feldman, Dan, et al. "Bi-criteria linear-time >>> approximations for generalized k-mean/median/center." *Proceedings of the >>> twenty-third annual symposium on Computational geometry*. ACM, 2007. >>> >>> While I understand that kmeans++ is preferable option, others could be >>> also used for testing, trials and evaluations as well. >>> >>> I'd like to propose to separate logic of seeding and clustering to >>> increase flexibility for kmeans clustering. Would be glad to hear your >>> comments, pros/cons or rejections... >>> >>> >>> I've found "Scalable KMeans" or kmeans|| as referred in the >> http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf, which >> provides >> parallelizable seeding procedure. >> I guess this might serve as additional +1 vote for doing separation >> between seeding and LLoyd's iterations in current implementations of >> kmeans. >> > > I guess that, around here, you are the expert about these algorithms... > Thanks for providing me a credit, while I'm still not an expert :) I'd say "have reasonable knowledge"... So go ahead and (re)write the code as you see fit, while still taking into > account that the code should be self-documenting as much as possible. > And OO (since this is Java). > Will do my best. > If you are up for a major refactoring (e.g. for sparse data), I'd suggest > to do it in a new package, so that we can easily compare the old and new > codes (e.g. run the tests). > I'm working on it, this is not as trivial as I initially though, described several concerns in other threads, one of them for example the usage of generic parameter T and some assumption which enforced by Clusterable interface. And if you contemplate parallelization, I wonder whether the issue of > switching to Java 8 might not have to be resolved first. > I'm absolutely positive switching to Java 8, as I can see many benefits both API and performance wise. What it is the proper process to move for Java 8? Best, Artem Barger.