Re: MLib KMeans on large dataset issues

Sam Stoelinga Wed, 29 Apr 2015 05:50:06 -0700

I'm mostly using example code, see here:
http://paste.openstack.org/show/211966/
The data has 799305 dimensions and is separated by space


Please note the issues I'm seeing is because of the scala implementation
imo as it happens also when using the Python wrappers.



On Wed, Apr 29, 2015 at 8:00 PM, Jeetendra Gangele <gangele...@gmail.com>
wrote:

> How you are passing feature vector to K means?
> its in 2-D space of 1-D array?
>
> Did you try using Streaming Kmeans?
>
> will you be able to paste code here?
>
> On 29 April 2015 at 17:23, Sam Stoelinga <sammiest...@gmail.com> wrote:
>
>> Hi Sparkers,
>>
>> I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a
>> large K but I've encountered the following issues:
>>
>>
>>    - Spark driver gets out of memory and dies because collect gets
>>    called as part of KMeans, which loads all data back to the driver's 
>> memory.
>>    - At the end there is a LocalKMeans class which runs KMeansPlusPlus
>>    on the Spark driver. Why isn't this distributed? It's spending a long time
>>    on here and this has the same problem as point 1 requires loading the data
>>    to the driver.
>>    Also when LocakKMeans is running on driver also seeing lots of :
>>    15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus
>>    initialization ran out of distinct points for centers. Using duplicate
>>    point for center k = 222
>>    - Has the above behaviour been like this in previous releases? I
>>    remember running KMeans before without too much problems.
>>
>> Looking forward to hear you point out my stupidity or provide
>> work-arounds that could make Spark KMeans work well on large datasets.
>>
>> Regards,
>> Sam Stoelinga
>>
>
>
>
>

Re: MLib KMeans on large dataset issues

Reply via email to