Continuous values are being used now in addition to a large set of boolean flags. I think I could convert the continuous values to some sort of bucketed values that could be represented as additional flags. If that was the case would the format need to be ... id1 flaga flagb id2 flagb flagc
Also, I'm working more towards getting an example of going from feature vectors rather than a text document that can be turned over to a data science group. Naive Bayes is what is being used now with data extracted via Hive and loaded into R. As a start I'm trying to come up with an example that replicates that data flow using data in Hive and Mahout for processing. On Wed, Aug 7, 2013 at 6:29 PM, Ted Dunning <[email protected]> wrote: > By non-text, do you mean continuous values? Or sparse sets of tokens? > > The general idea for Naive Bayes is that it requires input consisting of > sparse sets of tokens. > > > > On Wed, Aug 7, 2013 at 2:00 PM, John Meagher <[email protected]> wrote: > >> I'm just starting work with Mahout and I'm struggling getting an >> example of a non-text based Naive Bayes classifier up and running. >> The input will be feature vectors generated outside of Mahout. As a >> test I'm using arff files (anything else CSV-ish will work). I've >> been able to convert things into vectors in a few different ways, but >> can't figure out what is needed to get the trainnb command to work. >> >> Does the label index need to be generated through some manual process >> or something other than the arff.vector or trainnb command? >> >> Is there a specific format needed for the input arff files? Specific >> columns in a specific order? >> >> >> Here's what I've tried so far in both 0.7 from CDH4 and 0.8 direct from >> Apache: >> >> $ wget http://repository.seasr.org/Datasets/UCI/arff/iris.arff >> $ mahout arff.vector --input iris.arff --output iris.model --dictOut >> iris.labels >> >> This works and seems to be right so far >> >> This is the command I think I need to train the Naive Bayes model. It >> fails when creating the label index with the exception below. >> >> $ mahout trainnb -i iris.model/ -o iris.training -el -li >> iris.training.labels >> ... >> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 >> at >> org.apache.mahout.classifier.naivebayes.BayesUtils.writeLabelIndex(BayesUtils.java:123) >> at >> org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.createLabelIndex(TrainNaiveBayesJob.java:180) >> at >> org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:94) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> ... >> >> >> Thanks for the help, >> John >>
