Yes, I have done repartition.

I tried to repartition to the number of cores in my cluster. Not helping...
I tried to repartition to the number of centroids (k value). Not helping...


On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley <jos...@databricks.com>
wrote:

> Can you try specifying the number of partitions when you load the data to
> equal the number of executors?  If your ETL changes the number of
> partitions, you can also repartition before calling KMeans.
>
>
> On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen <davidshe...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a large data set, and I expects to get 5000 clusters.
>>
>> I load the raw data, convert them into DenseVector; then I did
>> repartition and cache; finally I give the RDD[Vector] to KMeans.train().
>>
>> Now the job is running, and data are loaded. But according to the Spark
>> UI, all data are loaded onto one executor. I checked that executor, and its
>> CPU workload is very low. I think it is using only 1 of the 8 cores. And
>> all other 3 executors are at rest.
>>
>> Did I miss something? Is it possible to distribute the workload to all 4
>> executors?
>>
>>
>> Thanks,
>> David
>>
>>
>

Reply via email to