SVD techniques probably won't actually help that much given your current
sparsity.  There are two issues:

first, your data is already quite small.  SVD will only make it larger
because the average number of non-zero elements will increase dramatically.

second, given your sparsity, SVD will have very little to work with.  Very
sparse data elements are inherently nearly orthogonal.

I think you need to find more features so that your average number of
non-zeros goes up.

On Sat, Mar 9, 2013 at 12:53 PM, Colum Foley <[email protected]> wrote:

> Thanks a lot Ted. I think there's some preprocessing I can do to remove
> some outliers which may reduce my matrix size considerably.ill also check
> out some SVD techniques
> On 9 Mar 2013 17:16, "Ted Dunning" <[email protected]> wrote:
>
> > The new streaming k-means should be able to handle that data pretty
> > efficiently.  My guess is that on a single 16 core machine if should be
> > able to complete the clustering in 10 minutes or so.  That is
> extrapolation
> > and thus could be wildly off, of course.
> >
> > You definitely mean sparse.  30 M / 20 M = 1.5 non-zero features per row.
> >  That may be a problem.  Or it might make the clustering fairly trivial.
> >
> > Dan,
> >
> > That code isn't checked into trunk yet, but I think.   Can you comment on
> > where working code can be found on github?
> >
> > On Sat, Mar 9, 2013 at 6:36 AM, Colum Foley <[email protected]>
> wrote:
> >
> > > I have approximately 20million items and a feature vector of approx 30
> > > million in length,very sparse.
> > >
> > > Would you have any suggestions for other clustering algorithms I should
> > > look at ?
> > >
> > > Thanks,
> > > Colum
> > >
> > > On 8 Mar 2013, at 22:51, Ted Dunning <[email protected]> wrote:
> > >
> > > > You are beginning to exit the realm of reasonable applicability for
> > > normal
> > > > k-means algorithms here.
> > > >
> > > > How much data do you have?
> > > >
> > > > On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <[email protected]>
> > > wrote:
> > > >
> > > >> Hi All,
> > > >>
> > > >> When I run KMeans clustering on a cluster, i notice that when I have
> > > >> "large" values for k (i.e approx >1000) I get loads of hadoop write
> > > >> errors:
> > > >>
> > > >> INFO hdfs.DFSClient: Exception in createBlockOutputStream
> > > >> java.net.SocketTimeoutException: 69000 millis timeout while waiting
> > > >> for channel to be ready for read. ch :
> java.nio.channels.SocketChannel
> > > >>
> > > >> This continues indefinitely and lots of part-0xxxxx files are
> produced
> > > >> of sizes of around 30kbs.
> > > >>
> > > >> If I reduce the value for k it runs fine. Furthermore If I run it in
> > > >> local mode with high values of k it runs fine.
> > > >>
> > > >> The command I am using is as follows:
> > > >>
> > > >> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults
> > > >> --clusters tmp -dm
> > > >> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
> -cd
> > > >> 1.0 -x 20 -cl -k 10000
> > > >>
> > > >> I am running mahout 0.7.
> > > >>
> > > >> Are there some performance parameters I need to tune for mahout when
> > > >> dealing with large volumes of data?
> > > >>
> > > >> Thanks,
> > > >> Colum
> > > >>
> > >
> >
>

Reply via email to