Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Christoph Brücke
Hi Ankur, thank you for answering. But my problem is not, that I'm stuck in a local extrema but rather the reproducibility of kmeans. Want I'm trying to achieve is: when the input data and all the parameters stay the same, especially the seed, I want to get the exact same results. Even though the

Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Yu Zhang
I agree with what Ankur said. The kmeans seeding program ('takeSample' method) runs in parallel, so each partition has its sampling points based on the local data which will cause the 'partition agnostic'. The seeding method is based on Bahmani et al. kmeansII algorithm which gives approximation gu

Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Ankur Srivastava
Hi Christoph, I am not an expert in ML and have not used Spark KMeans but your problem seems to be an issue of local minimum vs global minimum. You should run K-means multiple times with random starting point and also try with multiple values of K (unless you are already sure). Hope this helps.

Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Christoph Bruecke
Hi Anastasios, thanks for the reply but caching doesn’t seem to change anything. After further investigation it really seems that the RDD#takeSample method is the cause of the non-reproducibility. Is this considered a bug and should I open an Issue for that? BTW: my example script contains a l

Re: KMeans Clustering is not Reproducible

2017-05-22 Thread Anastasios Zouzias
Hi Christoph, Take a look at this, you might end up having a similar case: http://www.spark.tc/using-sparks-cache-for-correctness-not-just-performance/ If this is not the case, then I agree with you the kmeans should be partitioning agnostic (although I haven't check the code yet). Best, Anasta

KMeans Clustering is not Reproducible

2017-05-22 Thread Christoph Bruecke
Hi, I’m trying to figure out how to use KMeans in order to achieve reproducible results. I have found that running the same kmeans instance on the same data, with different partitioning will produce different clusterings. Given a simple KMeans run with fixed seed returns different results on th