Re: KMeans Throwing Hadoop write errors for large values of K

Ted Dunning Sat, 09 Mar 2013 09:16:42 -0800

The new streaming k-means should be able to handle that data pretty
efficiently.  My guess is that on a single 16 core machine if should be
able to complete the clustering in 10 minutes or so.  That is extrapolation
and thus could be wildly off, of course.


You definitely mean sparse.  30 M / 20 M = 1.5 non-zero features per row.
 That may be a problem.  Or it might make the clustering fairly trivial.

Dan,

That code isn't checked into trunk yet, but I think.   Can you comment on
where working code can be found on github?

On Sat, Mar 9, 2013 at 6:36 AM, Colum Foley <[email protected]> wrote:

> I have approximately 20million items and a feature vector of approx 30
> million in length,very sparse.
>
> Would you have any suggestions for other clustering algorithms I should
> look at ?
>
> Thanks,
> Colum
>
> On 8 Mar 2013, at 22:51, Ted Dunning <[email protected]> wrote:
>
> > You are beginning to exit the realm of reasonable applicability for
> normal
> > k-means algorithms here.
> >
> > How much data do you have?
> >
> > On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <[email protected]>
> wrote:
> >
> >> Hi All,
> >>
> >> When I run KMeans clustering on a cluster, i notice that when I have
> >> "large" values for k (i.e approx >1000) I get loads of hadoop write
> >> errors:
> >>
> >> INFO hdfs.DFSClient: Exception in createBlockOutputStream
> >> java.net.SocketTimeoutException: 69000 millis timeout while waiting
> >> for channel to be ready for read. ch : java.nio.channels.SocketChannel
> >>
> >> This continues indefinitely and lots of part-0xxxxx files are produced
> >> of sizes of around 30kbs.
> >>
> >> If I reduce the value for k it runs fine. Furthermore If I run it in
> >> local mode with high values of k it runs fine.
> >>
> >> The command I am using is as follows:
> >>
> >> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
> >> --clusters tmp -dm
> >> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd
> >> 1.0 -x 20 -cl -k 10000
> >>
> >> I am running mahout 0.7.
> >>
> >> Are there some performance parameters I need to tune for mahout when
> >> dealing with large volumes of data?
> >>
> >> Thanks,
> >> Colum
> >>
>

Re: KMeans Throwing Hadoop write errors for large values of K

Reply via email to