How many partitions do you use for your data? if the default is 1, you probably need to manually ask for more partitions.
Also, I'd check that your executors aren't thrashing close to the GC limit. This can make things start to get very slow. On Fri, Jul 11, 2014 at 9:53 AM, durin <m...@simon-schaefer.net> wrote: > Hi, > > I'm trying to use org.apache.spark.mllib.clustering.KMeans to do some basic > clustering with Strings. > > My code works great when I use a five-figure amount of training elements. > However, with for example 2 million elements, it gets extremely slow. A > single stage may take up to 30 minutes. > > From the Web UI, I can see that it does these three things repeatedly: > > > All of these tasks only use one executor, and on that executor only one > core. And I can see a scheduler delay of about 25 seconds. > > I tried to use broadcast variables to speed this up, but maybe I'm using it > wrong. The relevant code (where it gets slow) is this: > > > > > What could I do to use more executors, and generally speed this up? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-for-large-training-data-tp9407.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.