Hi Shweta I guess I can handle this. I always specify namedVector option when generation term vector(seq2sparse) as;
$MAHOUT_HOME/bin/mahout seq2sparse --namedVector -i MyJob/MyJob-seqfile/ -o MyJob/MyJob-namedVector -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n and then run Kmeans using this Named Vector input like $MAHOUT_HOME/bin/mahout kmeans -i MyJob/MyJob-namedVector/tfidf-vectors/ -c MyJob/MyJob-initial-namedVector-clusters -o MyJob/MyJob-kmeans-namedVector-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.01 -k 12 -x 20 -cl The dump the result on your text file as; $MAHOUT_HOME/bin/mahout clusterdump --pointsDir MyJob/MyJob-kmeans-namedVector-clusters/clusteredPoints -dt sequencefile -d MyJob/MyJob-namedVector/dictionary.file-* -i MyJob/MyJob-kmeans-namedVector-clusters/clusters-8-final -o /home/hadoop/MyJob/MyJob-kmeans-namedVector-clusterdump01.txt -b 100 -n 20 Then you should see all the cluster information such as cluster Id., # of docs. in the cluster, doc.Id in that cluster,top terms,etc. *Note that this example is from Mahout-0.7. Try it. Good luck. Y.Mandai 2014-12-08 14:39 GMT+09:00 shweta agrawal <[email protected]>: > > Hello, > I am new to mahout. I am working on mahout clustering to detect topic. I > have done mahout kmeans clustering and i got the top terms of cluster also, > but i want the document id of the clusters. How to get which document is in > which cluster? > > Thanks and Regards > Shweta > >
