Hi Ankur,
thank you for answering. But my problem is not, that I'm stuck in a local
extrema but rather the reproducibility of kmeans. Want I'm trying to
achieve is: when the input data and all the parameters stay the same,
especially the seed, I want to get the exact same results. Even though the
I agree with what Ankur said. The kmeans seeding program ('takeSample'
method) runs in parallel, so each partition has its sampling points based
on the local data which will cause the 'partition agnostic'. The seeding
method is based on Bahmani et al. kmeansII algorithm which gives
approximation gu
Hi Christoph,
I am not an expert in ML and have not used Spark KMeans but your problem
seems to be an issue of local minimum vs global minimum. You should run
K-means multiple times with random starting point and also try with
multiple values of K (unless you are already sure).
Hope this helps.
Hi Anastasios,
thanks for the reply but caching doesn’t seem to change anything.
After further investigation it really seems that the RDD#takeSample method is
the cause of the non-reproducibility.
Is this considered a bug and should I open an Issue for that?
BTW: my example script contains a l
Hi Christoph,
Take a look at this, you might end up having a similar case:
http://www.spark.tc/using-sparks-cache-for-correctness-not-just-performance/
If this is not the case, then I agree with you the kmeans should be
partitioning agnostic (although I haven't check the code yet).
Best,
Anasta
Hi,
I’m trying to figure out how to use KMeans in order to achieve reproducible
results. I have found that running the same kmeans instance on the same data,
with different partitioning will produce different clusterings.
Given a simple KMeans run with fixed seed returns different results on th