Re: extremely slow k-means version

2014-04-19 Thread ticup
Thanks a lot for the explanation Matei. As a matter of fact, I was just reading up on the paper on the Narrow and Wide Dependencies and saw that groupByKey is indeed a wide dependency which, as you explained, is the problem. Maybe it wouldn't be a bad thing to have a section in the docs on the wi

Re: extremely slow k-means version

2014-04-19 Thread Matei Zaharia
The problem is that groupByKey means “bring all the points with this same key to the same JVM”. Your input is a Seq[Point], so you have to have all the points there. This means that a) all points will be sent across the network in a cluster, which is slow (and Spark goes through this sending cod