Right now, I have issues even at a far earlier point.
I'm fetching data from a registerd table via
var texts = ctx.sql("SELECT text FROM tweetTrainTable LIMIT
2000").map(_.head.toString).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
//persisted because it's used again lat
How many partitions now? Btw, which Spark version are you using? I
checked your code and I don't understand why you want to broadcast
vectors2, which is an RDD.
var vectors2 =
vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
var broadcastVector = sc.bro
With a lower number of partitions, I keep losing executors during
collect at KMeans.scala:283
The error message is "ExecutorLostFailure (executor lost)".
The program recovers by automatically repartitioning the whole dataset
(126G), which takes very long and seems to only delay the inevitable
There are only 5 worker nodes. So please try to reduce the number of
partitions to the number of available CPU cores. 1000 partitions are
too bigger, because the driver needs to collect to task result from
each partition. -Xiangrui
On Tue, Aug 19, 2014 at 1:41 PM, durin wrote:
> When trying to us