Yes, I have done repartition. I tried to repartition to the number of cores in my cluster. Not helping... I tried to repartition to the number of centroids (k value). Not helping...
On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley <jos...@databricks.com> wrote: > Can you try specifying the number of partitions when you load the data to > equal the number of executors? If your ETL changes the number of > partitions, you can also repartition before calling KMeans. > > > On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen <davidshe...@gmail.com> wrote: > >> Hi, >> >> I have a large data set, and I expects to get 5000 clusters. >> >> I load the raw data, convert them into DenseVector; then I did >> repartition and cache; finally I give the RDD[Vector] to KMeans.train(). >> >> Now the job is running, and data are loaded. But according to the Spark >> UI, all data are loaded onto one executor. I checked that executor, and its >> CPU workload is very low. I think it is using only 1 of the 8 cores. And >> all other 3 executors are at rest. >> >> Did I miss something? Is it possible to distribute the workload to all 4 >> executors? >> >> >> Thanks, >> David >> >> >