SVD techniques probably won't actually help that much given your current sparsity. There are two issues:
first, your data is already quite small. SVD will only make it larger because the average number of non-zero elements will increase dramatically. second, given your sparsity, SVD will have very little to work with. Very sparse data elements are inherently nearly orthogonal. I think you need to find more features so that your average number of non-zeros goes up. On Sat, Mar 9, 2013 at 12:53 PM, Colum Foley <[email protected]> wrote: > Thanks a lot Ted. I think there's some preprocessing I can do to remove > some outliers which may reduce my matrix size considerably.ill also check > out some SVD techniques > On 9 Mar 2013 17:16, "Ted Dunning" <[email protected]> wrote: > > > The new streaming k-means should be able to handle that data pretty > > efficiently. My guess is that on a single 16 core machine if should be > > able to complete the clustering in 10 minutes or so. That is > extrapolation > > and thus could be wildly off, of course. > > > > You definitely mean sparse. 30 M / 20 M = 1.5 non-zero features per row. > > That may be a problem. Or it might make the clustering fairly trivial. > > > > Dan, > > > > That code isn't checked into trunk yet, but I think. Can you comment on > > where working code can be found on github? > > > > On Sat, Mar 9, 2013 at 6:36 AM, Colum Foley <[email protected]> > wrote: > > > > > I have approximately 20million items and a feature vector of approx 30 > > > million in length,very sparse. > > > > > > Would you have any suggestions for other clustering algorithms I should > > > look at ? > > > > > > Thanks, > > > Colum > > > > > > On 8 Mar 2013, at 22:51, Ted Dunning <[email protected]> wrote: > > > > > > > You are beginning to exit the realm of reasonable applicability for > > > normal > > > > k-means algorithms here. > > > > > > > > How much data do you have? > > > > > > > > On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <[email protected]> > > > wrote: > > > > > > > >> Hi All, > > > >> > > > >> When I run KMeans clustering on a cluster, i notice that when I have > > > >> "large" values for k (i.e approx >1000) I get loads of hadoop write > > > >> errors: > > > >> > > > >> INFO hdfs.DFSClient: Exception in createBlockOutputStream > > > >> java.net.SocketTimeoutException: 69000 millis timeout while waiting > > > >> for channel to be ready for read. ch : > java.nio.channels.SocketChannel > > > >> > > > >> This continues indefinitely and lots of part-0xxxxx files are > produced > > > >> of sizes of around 30kbs. > > > >> > > > >> If I reduce the value for k it runs fine. Furthermore If I run it in > > > >> local mode with high values of k it runs fine. > > > >> > > > >> The command I am using is as follows: > > > >> > > > >> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults > > > >> --clusters tmp -dm > > > >> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure > -cd > > > >> 1.0 -x 20 -cl -k 10000 > > > >> > > > >> I am running mahout 0.7. > > > >> > > > >> Are there some performance parameters I need to tune for mahout when > > > >> dealing with large volumes of data? > > > >> > > > >> Thanks, > > > >> Colum > > > >> > > > > > >
