Hi Pat,

thanks for the response. It turns out the reason why mahout was crashing
had nothing to do with the initial clusters. I was using seqdirectory to
convert the reuters text files into sequence format and then seq2sparse to
vectorise them. Something was going wrong with these initial steps (I think
perhaps due to permissions issues with writing to hdfs) such that although
they appeared to run without error the output files that were produced were
empty. This later caused kmeans to crash because it was trying to randomly
generate initial cluster centroids using vectorised data from empty files.

Cheers,

Sean




On Tue, Nov 18, 2014 at 11:24 AM, Pat Ferrel <[email protected]> wrote:

> 1) Use Mahout 0.9. There may be some slight differences from the version
> in MIA but there are also many bug fixes.
> 2) k is set to 20, check the log " --numClusters=[20]”
> 3) going from memory (which could be failing me) you either give it
> initial clusters or not. Giving it a path tells it to use the clusters
> there (used to be used with Canopy, now deprecated). Try leaving the
> reuters-initial-clusters path unset.
>
> On Nov 11, 2014, at 8:53 PM, Sean Farrell <[email protected]> wrote:
>
> Hi all,
>
> I'm working through the Kmeans clustering example in 'Mahout in Action' and
> I've run into an issue regarding randomly generating the initial cluster
> centroids. According to MIA (and the examples on the Mahout web page) if
> you set the -k flag then the algorithm will use a random seed generator to
> produce initial cluster centroids for however many clusters you select
> (i.e. the number after -k). However, I'm getting an illegal state exception
> error saying that no clusters are found in my directory path and that I
> should check my -c argument (which sets the path for the initial cluster
> centroids sequence file). Reading through the output prior to the error it
> seems as though the -k flag is not being recognised.
>
> A search through the mailing list archive finds that this is not a new
> problem, but I can't find a solution posted anywhere (other than one case
> where upgrading from v0.7 to v0.8 fixed it). Does anyone know if this has
> been solved?
>
> Here are the commands I am using:
>
> > mahout kmeans -i /user/hdfs/Vectors/reuters-
> vectors/tfidf-vectors/ -c /user/hdfs/Vectors/reuters-initial-clusters/ -o
> /user/hdfs/Vectors/reuters-kmeans-clusters/ -dm
> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0
> -k 20 -x 20 -cl
>
>
> And hear is the output:
>
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using
> /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop/bin/hadoop and
> HADOOP_CONF_DIR=/etc/hadoop/conf
> MAHOUT-JOB:
>
> /opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/mahout/mahout-examples-0.9-cdh5.2.0-job.jar
> 14/11/12 15:23:56 WARN driver.MahoutDriver: No kmeans.props found on
> classpath, will use command-line arguments only
> 14/11/12 15:23:57 INFO common.AbstractJob: Command line arguments:
> {--clustering=null,
> --clusters=[/user/hdfs/Vectors/reuters-initial-clusters/],
> --convergenceDelta=[1.0],
>
> --distanceMeasure=[org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure],
> --endPhase=[2147483647],
> --input=[/user/hdfs/Vectors/reuters-vectors/tfidf-vectors/],
> --maxIter=[20], --method=[mapreduce], --numClusters=[20],
> --output=[/user/hdfs/Vectors/reuters-kmeans-clusters/], --startPhase=[0],
> --tempDir=[temp]}
> 14/11/12 15:23:57 INFO common.HadoopUtil: Deleting
> /user/hdfs/Vectors/reuters-initial-clusters
> 14/11/12 15:23:58 INFO zlib.ZlibFactory: Successfully loaded & initialized
> native-zlib library
> 14/11/12 15:23:58 INFO compress.CodecPool: Got brand-new compressor
> [.deflate]
> 14/11/12 15:23:58 INFO kmeans.RandomSeedGenerator: Wrote 20 Klusters to
> /user/hdfs/Vectors/reuters-initial-clusters/part-randomSeed
> 14/11/12 15:23:58 INFO kmeans.KMeansDriver: Input:
> /user/hdfs/Vectors/reuters-vectors/tfidf-vectors Clusters In:
> /user/hdfs/Vectors/reuters-initial-clusters/part-randomSeed Out:
> /user/hdfs/Vectors/reuters-kmeans-clusters
> 14/11/12 15:23:58 INFO kmeans.KMeansDriver: convergence: 1.0 max
> Iterations: 20
> 14/11/12 15:23:58 INFO compress.CodecPool: Got brand-new decompressor
> [.deflate]
> Exception in thread "main" java.lang.IllegalStateException: No input
> clusters found in
> /user/hdfs/Vectors/reuters-initial-clusters/part-randomSeed. Check your -c
> argument.
>        at
>
> org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:206)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:140)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:103)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>        at
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:47)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
>        at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:145)
>        at
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:153)
>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
>
>

Reply via email to