Re: Only master is really busy at KMeans training

2014-08-26 Thread durin
Right now, I have issues even at a far earlier point. I'm fetching data from a registerd table via var texts = ctx.sql("SELECT text FROM tweetTrainTable LIMIT 2000").map(_.head.toString).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) //persisted because it's used again lat

Re: Only master is really busy at KMeans training

2014-08-26 Thread Xiangrui Meng
How many partitions now? Btw, which Spark version are you using? I checked your code and I don't understand why you want to broadcast vectors2, which is an RDD. var vectors2 = vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) var broadcastVector = sc.bro

Re: Only master is really busy at KMeans training

2014-08-25 Thread durin
With a lower number of partitions, I keep losing executors during collect at KMeans.scala:283 The error message is "ExecutorLostFailure (executor lost)". The program recovers by automatically repartitioning the whole dataset (126G), which takes very long and seems to only delay the inevitable

Re: Only master is really busy at KMeans training

2014-08-19 Thread Xiangrui Meng
There are only 5 worker nodes. So please try to reduce the number of partitions to the number of available CPU cores. 1000 partitions are too bigger, because the driver needs to collect to task result from each partition. -Xiangrui On Tue, Aug 19, 2014 at 1:41 PM, durin wrote: > When trying to us