I am using Mahout version .7

I have used the complementary naive bayes classifier to classify basic spam/ham 
messages like so:

Copy easy_ham and spam directories into 20news-all:
 cp -R easy_ham/ spam/ 20news-all/

Copy 20news-all to HDFS:
hadoop fs -put 20news-all

Prepare data by sequencing into vectors:
 mahout seqdirectory -i 20news-all -o 20news-seq
 mahout seq2sparse -i 20news-seq -o 20news-vectors  -lnorm -nv  -wt tfidf

Split data into train and test sets with 20% of the data being used for test 
and 80% for train:
mahout split -i 20news-vectors/tfidf-vectors --trainingOutput 
20news-train-vectors --testOutput 20news-test-vectors --randomSelectionPct 20 
--overwrite --sequenceFiles -xm sequential

Build the model:
mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow -c

You can test the model against the training set:
mahout testnb -i 20news-train-vectors -m model -l labelindex -ow -o 
20news-testing-train -c

Now test against the test set:
mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o 
20news-testing-test -c


This all works fine, I get good results with my Confusion Matrix output.

Now what if I have a message called message.txt.  How would I pass this to my 
data model to see if it classifies it as spam or ham?  


Reply via email to