Re: MLib KMeans on large dataset issues

Sam Stoelinga Wed, 29 Apr 2015 06:08:24 -0700

Guys, great feedback by pointing out my stupidity :D

Rows and columns got intermixed hence the weird results I was seeing.
Ignore my previous issues will reformat my data first.


On Wed, Apr 29, 2015 at 8:47 PM, Sam Stoelinga <sammiest...@gmail.com>
wrote:

> I'm mostly using example code, see here:
> http://paste.openstack.org/show/211966/
> The data has 799305 dimensions and is separated by space
>
> Please note the issues I'm seeing is because of the scala implementation
> imo as it happens also when using the Python wrappers.
>
>
>
> On Wed, Apr 29, 2015 at 8:00 PM, Jeetendra Gangele <gangele...@gmail.com>
> wrote:
>
>> How you are passing feature vector to K means?
>> its in 2-D space of 1-D array?
>>
>> Did you try using Streaming Kmeans?
>>
>> will you be able to paste code here?
>>
>> On 29 April 2015 at 17:23, Sam Stoelinga <sammiest...@gmail.com> wrote:
>>
>>> Hi Sparkers,
>>>
>>> I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a
>>> large K but I've encountered the following issues:
>>>
>>>
>>>    - Spark driver gets out of memory and dies because collect gets
>>>    called as part of KMeans, which loads all data back to the driver's 
>>> memory.
>>>    - At the end there is a LocalKMeans class which runs KMeansPlusPlus
>>>    on the Spark driver. Why isn't this distributed? It's spending a long 
>>> time
>>>    on here and this has the same problem as point 1 requires loading the 
>>> data
>>>    to the driver.
>>>    Also when LocakKMeans is running on driver also seeing lots of :
>>>    15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus
>>>    initialization ran out of distinct points for centers. Using duplicate
>>>    point for center k = 222
>>>    - Has the above behaviour been like this in previous releases? I
>>>    remember running KMeans before without too much problems.
>>>
>>> Looking forward to hear you point out my stupidity or provide
>>> work-arounds that could make Spark KMeans work well on large datasets.
>>>
>>> Regards,
>>> Sam Stoelinga
>>>
>>
>>
>>
>>
>

Re: MLib KMeans on large dataset issues

Reply via email to