To use Naive Bayes you need a Sequence File <Text, VectorWritable> with the key formatted like this "label/label" for some reason I checked with the sources to be sure and it parses it looking for a '/'.
When y used seqdirectory, it told Naive Bayes to classify the content of each file (ex : file1.txt) with the label corresponding to its name (here, file1.txt). So when you tried testing with input0.txt it failed because input0.txt was not a valid label. I designed a MapReduce java job that transforms a csv with numeric values into a proper SequenceFile, if you want you can take the source and update if to suit your need : https://github.com/kmoulart/hadoop_mahout_utils Good luck. Kévin Moulart 2014-03-18 20:13 GMT+01:00 Frank Scholten <[email protected]>: > Hi Tharindu, > > If I understand correctly seqdirectory creates labels based on the file > name but this is not what you want. What do you want the labels to be? > > Cheers, > > Frank > > > On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira > <[email protected]>wrote: > > > Hi everyone, > > I'm developing an application where I need to train a Naive Bayes > > classification model and use this model to classify new entities(In this > > case text files based on their content) > > > > I observed that seqdirectory command always adds the file/directory name > as > > the "key" field for each document which will be used as the label in > > classification jobs. > > This makes sense when I need to train a model and create the labelindex > > since I have organized my training data according to their labels in > > separate directories. > > > > Now I'm trying to use this model and infer the best label for an unknown > > document. > > My requirement is to ask Mahout to read my new file and output the > > predicted category by looking at the labelindex and the tfidf vector of > the > > new content. > > I tried creating vectors from the new content (seqdirectory and > > seq2sparse), and then using this vector to run testnb command. But > > unfortunately seqdirectory commands adds file names as labels which does > > not make sense in classification. > > > > The following error message will further demonstrate this behavior. > > imput0.txt is the file name of my new document. > > > > [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while > > classifying documents > > java.lang.IllegalArgumentException: Label not found: input0.txt > > at > > > com.google.common.base.Preconditions.checkArgument(Preconditions.java:125) > > at > > > > > org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182) > > at > > > > > org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205) > > at > > > > > org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209) > > at > > > > > org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173) > > at > > > > > org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70) > > at > > > > > org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults(TestNaiveBayesDriver.java:160) > > at > > > > > org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:125) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at > > > > > org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.main(TestNaiveBayesDriver.java:66) > > > > > > So how can I achieve what I'm trying to do here? > > > > Thanks, > > > > > > -- > > M.P. Tharindu Rusira Kumara > > > > Department of Computer Science and Engineering, > > University of Moratuwa, > > Sri Lanka. > > +94757033733 > > www.tharindu-rusira.blogspot.com > > >
