I have approximately 20million items and a feature vector of approx 30 million in length,very sparse.
Would you have any suggestions for other clustering algorithms I should look at ? Thanks, Colum On 8 Mar 2013, at 22:51, Ted Dunning <[email protected]> wrote: > You are beginning to exit the realm of reasonable applicability for normal > k-means algorithms here. > > How much data do you have? > > On Fri, Mar 8, 2013 at 9:46 AM, Colum Foley <[email protected]> wrote: > >> Hi All, >> >> When I run KMeans clustering on a cluster, i notice that when I have >> "large" values for k (i.e approx >1000) I get loads of hadoop write >> errors: >> >> INFO hdfs.DFSClient: Exception in createBlockOutputStream >> java.net.SocketTimeoutException: 69000 millis timeout while waiting >> for channel to be ready for read. ch : java.nio.channels.SocketChannel >> >> This continues indefinitely and lots of part-0xxxxx files are produced >> of sizes of around 30kbs. >> >> If I reduce the value for k it runs fine. Furthermore If I run it in >> local mode with high values of k it runs fine. >> >> The command I am using is as follows: >> >> mahout kmeans -i FeatureVectorsMahoutFormat -o ClusterResults >> --clusters tmp -dm >> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd >> 1.0 -x 20 -cl -k 10000 >> >> I am running mahout 0.7. >> >> Are there some performance parameters I need to tune for mahout when >> dealing with large volumes of data? >> >> Thanks, >> Colum >>
