seq2sparse uses Lucene Standard tokenization to generate the tfidf vectors. But since your data is in CSV format (from the example u had provided below) you should be using Mahout's CSVVectorIterator to generate the vectors.
See http://stackoverflow.com/questions/13663567/mahout-csv-to-vector-and-running-the-program Once you have generated the term vectors you also need to specify -cl option to the kmeans CLI to generate the clusters. Also you don't have to generate the centroids upfront (unless its something specific you ur use case), kmeans would generate random k centroids during execution. ________________________________ From: P Kal <[email protected]> To: [email protected] Sent: Friday, September 6, 2013 2:05 PM Subject: Kmeans - clustering help I'm trying to a kmeans clustering on only numeric data This is how my data looks header1, header2 header3, header4, header5 0,0,0,0,0 1,3,2,4,5 3,2,4,5,6 . . . about 3000 rows As the cluster centroids I created another file (0,0,0,0,0) (1,2,3,4,5) My understanding is that we'd have to change these text files to sequence files and then generate sparse vectors from this sequence file for kmeans clustering I've used the seqdirectory followed by seq2sparse, and at the end I have two folders, one for input and one for centroids Input folder has dirs generated by seq2sparse on the input sequence file Similarly the centroids folder has dirs generated by seq2sparse on the centroids sequence file The command I use to run kmeans mahout kmeans --input input/tfidf-vectors --output output -c centroids/tfidf-vectors --maxIter 20 and I get this error No input clusters found in centroids/tfidf-vectors Check your -c argument. The sequence files have data but the files generated by seq2sparse do not have any contents. Can someone please help. BTW all this on hdfs and not local mode
