1) Use Mahout 0.9. There may be some slight differences from the version in MIA but there are also many bug fixes. 2) k is set to 20, check the log " --numClusters=[20]” 3) going from memory (which could be failing me) you either give it initial clusters or not. Giving it a path tells it to use the clusters there (used to be used with Canopy, now deprecated). Try leaving the reuters-initial-clusters path unset.
On Nov 11, 2014, at 8:53 PM, Sean Farrell <[email protected]> wrote: Hi all, I'm working through the Kmeans clustering example in 'Mahout in Action' and I've run into an issue regarding randomly generating the initial cluster centroids. According to MIA (and the examples on the Mahout web page) if you set the -k flag then the algorithm will use a random seed generator to produce initial cluster centroids for however many clusters you select (i.e. the number after -k). However, I'm getting an illegal state exception error saying that no clusters are found in my directory path and that I should check my -c argument (which sets the path for the initial cluster centroids sequence file). Reading through the output prior to the error it seems as though the -k flag is not being recognised. A search through the mailing list archive finds that this is not a new problem, but I can't find a solution posted anywhere (other than one case where upgrading from v0.7 to v0.8 fixed it). Does anyone know if this has been solved? Here are the commands I am using: > mahout kmeans -i /user/hdfs/Vectors/reuters- vectors/tfidf-vectors/ -c /user/hdfs/Vectors/reuters-initial-clusters/ -o /user/hdfs/Vectors/reuters-kmeans-clusters/ -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0 -k 20 -x 20 -cl And hear is the output: MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf MAHOUT-JOB: /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/mahout/mahout-examples-0.9-cdh5.2.0-job.jar 14/11/12 15:23:56 WARN driver.MahoutDriver: No kmeans.props found on classpath, will use command-line arguments only 14/11/12 15:23:57 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=[/user/hdfs/Vectors/reuters-initial-clusters/], --convergenceDelta=[1.0], --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure], --endPhase=[2147483647], --input=[/user/hdfs/Vectors/reuters-vectors/tfidf-vectors/], --maxIter=[20], --method=[mapreduce], --numClusters=[20], --output=[/user/hdfs/Vectors/reuters-kmeans-clusters/], --startPhase=[0], --tempDir=[temp]} 14/11/12 15:23:57 INFO common.HadoopUtil: Deleting /user/hdfs/Vectors/reuters-initial-clusters 14/11/12 15:23:58 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 14/11/12 15:23:58 INFO compress.CodecPool: Got brand-new compressor [.deflate] 14/11/12 15:23:58 INFO kmeans.RandomSeedGenerator: Wrote 20 Klusters to /user/hdfs/Vectors/reuters-initial-clusters/part-randomSeed 14/11/12 15:23:58 INFO kmeans.KMeansDriver: Input: /user/hdfs/Vectors/reuters-vectors/tfidf-vectors Clusters In: /user/hdfs/Vectors/reuters-initial-clusters/part-randomSeed Out: /user/hdfs/Vectors/reuters-kmeans-clusters 14/11/12 15:23:58 INFO kmeans.KMeansDriver: convergence: 1.0 max Iterations: 20 14/11/12 15:23:58 INFO compress.CodecPool: Got brand-new decompressor [.deflate] Exception in thread "main" java.lang.IllegalStateException: No input clusters found in /user/hdfs/Vectors/reuters-initial-clusters/part-randomSeed. Check your -c argument. at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:206) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:140) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:103) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:47) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:153) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
