The new streaming k-means should be able to handle that data pretty efficiently. My guess is that on a single 16 core machine if should be able to complete the clustering in 10 minutes or so. That is extrapolation and thus could be wildly off, of course.
You definitely mean sparse. 30 M / 20 M = 1.5 non-zero features per row. That may be a problem. Or it might make the clustering fairly trivial. Dan, That code isn't checked into trunk yet, but I think. Can you comment on where working code can be found on github? On Sat, Mar 9, 2013 at 6:36 AM, Colum Foley <[email protected]> wrote: > I have approximately 20million items and a feature vector of approx 30 > million in length,very sparse. > > Would you have any suggestions for other clustering algorithms I should > look at ? > > Thanks, > Colum > > On 8 Mar 2013, at 22:51, Ted Dunning <[email protected]> wrote: > > > You are beginning to exit the realm of reasonable applicability for > normal > > k-means algorithms here. > > > > How much data do you have? > > > > On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <[email protected]> > wrote: > > > >> Hi All, > >> > >> When I run KMeans clustering on a cluster, i notice that when I have > >> "large" values for k (i.e approx >1000) I get loads of hadoop write > >> errors: > >> > >> INFO hdfs.DFSClient: Exception in createBlockOutputStream > >> java.net.SocketTimeoutException: 69000 millis timeout while waiting > >> for channel to be ready for read. ch : java.nio.channels.SocketChannel > >> > >> This continues indefinitely and lots of part-0xxxxx files are produced > >> of sizes of around 30kbs. > >> > >> If I reduce the value for k it runs fine. Furthermore If I run it in > >> local mode with high values of k it runs fine. > >> > >> The command I am using is as follows: > >> > >> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults > >> --clusters tmp -dm > >> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd > >> 1.0 -x 20 -cl -k 10000 > >> > >> I am running mahout 0.7. > >> > >> Are there some performance parameters I need to tune for mahout when > >> dealing with large volumes of data? > >> > >> Thanks, > >> Colum > >> >
