seq2sparse uses Lucene Standard tokenization to generate the tfidf vectors. But 
since your data is in CSV format (from the example u had provided below) you 
should be using Mahout's CSVVectorIterator to generate the vectors.

See 
http://stackoverflow.com/questions/13663567/mahout-csv-to-vector-and-running-the-program

Once you have generated the term vectors you also need to specify -cl option to 
the kmeans CLI to generate the clusters.
Also you don't have to generate the centroids upfront (unless its something 
specific you ur use case), kmeans would generate random k centroids during 
execution.





________________________________
 From: P Kal <[email protected]>
To: [email protected] 
Sent: Friday, September 6, 2013 2:05 PM
Subject: Kmeans - clustering help
 

I'm trying to a kmeans clustering on only numeric data

This is how my data looks
header1, header2 header3, header4, header5
0,0,0,0,0
1,3,2,4,5
3,2,4,5,6
.
.
.

about 3000 rows

As the cluster centroids I created another file
(0,0,0,0,0)
(1,2,3,4,5)

My understanding is that we'd have to change these text files to sequence
files and then generate sparse vectors from this sequence file for kmeans
clustering

I've used the seqdirectory followed by seq2sparse,
and at the end I have two folders, one for input and one for centroids

Input folder has dirs generated by seq2sparse on the input sequence file
Similarly the centroids folder has dirs generated by seq2sparse on the
centroids sequence file
The command I use to run kmeans

mahout kmeans --input input/tfidf-vectors --output output -c
centroids/tfidf-vectors --maxIter 20
and I get this error

No input clusters found in centroids/tfidf-vectors Check your -c argument.

The sequence files have data but the files generated by seq2sparse do not
have any contents.
Can someone please help.

BTW all this on hdfs and not local mode

Reply via email to