Guys, great feedback by pointing out my stupidity :D Rows and columns got intermixed hence the weird results I was seeing. Ignore my previous issues will reformat my data first.
On Wed, Apr 29, 2015 at 8:47 PM, Sam Stoelinga <sammiest...@gmail.com> wrote: > I'm mostly using example code, see here: > http://paste.openstack.org/show/211966/ > The data has 799305 dimensions and is separated by space > > Please note the issues I'm seeing is because of the scala implementation > imo as it happens also when using the Python wrappers. > > > > On Wed, Apr 29, 2015 at 8:00 PM, Jeetendra Gangele <gangele...@gmail.com> > wrote: > >> How you are passing feature vector to K means? >> its in 2-D space of 1-D array? >> >> Did you try using Streaming Kmeans? >> >> will you be able to paste code here? >> >> On 29 April 2015 at 17:23, Sam Stoelinga <sammiest...@gmail.com> wrote: >> >>> Hi Sparkers, >>> >>> I am trying to run MLib kmeans on a large dataset(50+Gb of data) and a >>> large K but I've encountered the following issues: >>> >>> >>> - Spark driver gets out of memory and dies because collect gets >>> called as part of KMeans, which loads all data back to the driver's >>> memory. >>> - At the end there is a LocalKMeans class which runs KMeansPlusPlus >>> on the Spark driver. Why isn't this distributed? It's spending a long >>> time >>> on here and this has the same problem as point 1 requires loading the >>> data >>> to the driver. >>> Also when LocakKMeans is running on driver also seeing lots of : >>> 15/04/29 08:42:25 WARN clustering.LocalKMeans: kMeansPlusPlus >>> initialization ran out of distinct points for centers. Using duplicate >>> point for center k = 222 >>> - Has the above behaviour been like this in previous releases? I >>> remember running KMeans before without too much problems. >>> >>> Looking forward to hear you point out my stupidity or provide >>> work-arounds that could make Spark KMeans work well on large datasets. >>> >>> Regards, >>> Sam Stoelinga >>> >> >> >> >> >