Hi Stuti,
I think you're right. The epsilon parameter is indeed used as a threshold
for deciding when KMeans has converged. If you look at line 201 of mllib's
KMeans.scala:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L201
you ca
Stuti,
- The two numbers at different contexts, but finally end up in two sides
of an && operator.
- A parallel K-Means consists of multiple iterations which in turn
consists of moving centroids around. A centroids would be deemed stabilized
when the root square distance between suc
Stuti,
I'm answering your questions in order:
1. From MLLib
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L159
*,* you can see that clustering stops when we have reached*maxIterations* or
there are no more*activeRuns*.
KMeans is e
In Spark's KMeans, if no cluster center moves more than epsilon in
Euclidean distance from previous iteration, the algorithm finishes. No
further iterations are performed. For Mahout, you need to check the
documentation or the code to see what epsilon means there. -Xiangrui
On Wed, May 14, 2014 at
It is running k-means many times, independently, from different random
starting points in order to pick the best clustering. Convergence ends
one run, not all of them.
Yes epsilon should be the same as "convergence threshold" elsewhere.
You can set epsilon if you instantiate KMeans directly. Mayb
Hi All,
Any ideas on this ??
Thanks
Stuti Awasthi
From: Stuti Awasthi
Sent: Wednesday, May 14, 2014 6:20 PM
To: user@spark.apache.org
Subject: Understanding epsilon in KMeans
Hi All,
I wanted to understand the functionality of epsilon in KMeans in Spark MLlib.
As per documentation :
distance