Hi Andrew, many thanks for final clarification! Now I have last question - probably the most obvious but I missed it somewhere probably. Because all the examples ends up by testing the classifier - display confusion matrix. So the state is: We have a trained and tested model and now we would like to use the model to classify unseen, unknown data - actually use the classifier. For sure it is clear how to prepare the input - vectorize etc. What is not clear to me at the moment is how do I call trained model with new vectorized data as an input. Or may be even the vectorization itself - because we need probably the dictionary used by model to produce a valid vectors. What about terms which we not in the training set etc.
Is there any documentation regarding this aspect? Thx Jakub On 1 December 2014 at 21:12, Andrew Palumbo <[email protected]> wrote: > > > > > However the sequence of steps as described in Mahout Cookbook seems to me > > incorrect as: > > this is entirely possible, that book may be out of date. The end to end > instructions on the website for the 20 newsgroups example is up to date > though. As is the example script. > > You don't want to merge all of the files into one directory, rather to > merge the training and testing sets in 20news-bydate while maintaining > their directory structure. > > > After data set download and extraction data are merged via command: > > *cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all* > > > > Which essentially copies files to a single location -> 20news-all folder > > this should not copy all of the *files* individually into the 20news-all > folder rather the directories containing the files: > > $ ls 20news-all/ > alt.atheism rec.autos sci.space > comp.graphics rec.motorcycles soc.religion.christian > {...} > > > *./mahout seqdirectory -i ${WORK_DIR}/20news-all -o > > ${WORK_DIR}/20news-seq* > > Converts to a hadoop sequence directory from 20news-all dir - where all > > files were copied and efffectively the classification to folders were > lost. > > We can peek inside a created seq file via hadoop fs -text > > $WORK_DIR/20news-seq/chunck-0 | more which prints following result: > > > > */67399* From:xxx > > Subject: Re: Imake-TeX: looking for beta testers > > Organization: CS Department, Dortmund University, Germany > > Lines: 59 > > Distribution: world > > NNTP-Posting-Host: tommy.informatik.uni-dortmund.de > > In article <xxxxx>, > > yyy writes: > > |> As I announced at the X Technical Conference in January, I would > > like > > |> to > > |> make Imake-TeX, the Imake support for using the TeX typesetting > > system, > > |> publically available. Currently Imake-TeX is in beta test here at > > the > > |> computer science department of Dortmund University, and I am > > looking > > ... > > > > To my understanding - number after slash in bold represents a key of > > sequence file, right? > > Correct though it should read something like: > > /comp.graphics/67399 {...} > > where comp.graphics is the category as well as the directory that it was > read in from. > > > Then seq2sparse is performed: > > > > ./mahout seq2sparse -i ${WORK_DIR}/20news-seq vectors -lnorm -nv -wt > > tfidf -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf > > > > > > *Conclusions which I would like to verify:* > > - sequence of steps as described is incorrect - particularly conversion > to > > sequence file as the key doesn't contain folder name describing the > > category of training data, or am I still missing something in here? > > yes- it looks like you are copying the individual files rather than the > directories into 20news-all > > > > > - mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o > > ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow > > What are the exact mechanics when label extraction is performed e.g. > > /category/docID as a key is resolved just to category ??? > > yes > > > Does every time > > the last part after the slash is dropped as a category?? Or is is > possible > > to define the strategy somewhere? > > The hard-coded convention as of Mahout 0.9 is to extract the label as the > first string after the key is split on "/". This makes category > organization by directory and sequence file conversion with seqdirectory > straightforward. The new scala DSL Naive Bayes which is currently in > development will allow the user more flexibility in extracting the label. > > The label extraction process can be found here: > > https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/training/IndexInstancesMapper.java > > and could me modified if need be. > > > > > Thanks > > Jakub > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 1 December 2014 at 17:43, Andrew Palumbo <[email protected]> wrote: > > > > > Hi Jakub, > > > > > > The step that you are missing is `$mahout seqdir ...`. in this step > each > > > file in each directory (where the directory is the Category) is > converted > > > into a sequence file of form <Text,Text> where the Text key is > > > /Category/doc_id. > > > > > > `$mahout seq2sparse ...` vectorizes the output of `$mahout seqdir ...` > > > into a sequence file of form <Text, VectorWritable> leaving the Keys > > > unchanged. > > > > > > `$mahout trainnb ... -el ...` then extracts the label from the Keys of > the > > > training data ie. the "Category" from /Category/doc_id. > > > > > > please see > > > http://mahout.apache.org/users/classification/twenty-newsgroups.html > > > and http://mahout.apache.org/users/classification/bayesian.html > > > for more information. > > > > > > > Date: Mon, 1 Dec 2014 17:09:55 +0100 > > > > Subject: Insights to Naive Bayes classifier example - 20news groups > > > > From: [email protected] > > > > To: [email protected] > > > > > > > > Hello Mahout experts, > > > > > > > > I am trying to follow some examples provided with Mahout and some > > > features > > > > are not clear to me. It would be great if someone could clarify a bit > > > more. > > > > > > > > To prepare a the data (train and test) the following sequence of > steps is > > > > perfomed (taken from mahout cookbook): > > > > > > > > All input is merged into single dir: > > > > *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all* > > > > > > > > Converted to hadoop sequence file and then vectorized: > > > > *./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o > > > ${WORK_DIR}/20news-**vectors > > > > -lnorm -nv -wt tfidf* > > > > > > > > Devided to test and train data: > > > > *./mahout split* > > > > *-i ${WORK_DIR}/20news-vectors/tfidf-vectors* > > > > *--trainingOutput ${WORK_DIR}/20news-train-vectors* > > > > *--testOutput ${WORK_DIR}/20news-test-vectors* > > > > *--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential* > > > > > > > > Model is trained: > > > > *./mahout trainnb* > > > > *-i ${WORK_DIR}/20news-train-vectors -el* > > > > *-o ${WORK_DIR}/model* > > > > *-li ${WORK_DIR}/labelindex* > > > > *-ow* > > > > > > > > > > > > What I am missing here and that is subject of my question is: Where > is > > > the > > > > category assigned to the testing data to train the categorization? > What I > > > > would expect is that there will be vector which says that this > document > > > > belongs to a particular category. This seems to me has been ereased > by > > > > first step where we mixed all the data to create our corpus. I would > > > still > > > > expect that this information will be somewhere retained. Instead the > > > > messages looks as follows: > > > > > > > > From: [email protected] (YEO YEK CHONG) > > > > Subject: Re: Is "Kermit" available for Windows 3.0/3.1? > > > > Organization: Oklahoma State University > > > > Lines: 7 > > > > > > > > From article <[email protected]>, by Steve Frampton < > > > > [email protected]>: > > > > > I was wondering, is the "Kermit" package (the actual package, not a > > > > > > > > Yes! In the usual ftp sites. > > > > > > > > Yek CHong > > > > > > > > > > > > There is no notion from which group this text belongs to. What's the > > > hack! > > > > > > > > Could someone please clarify a bit what's going on as when > > > crosswalidation > > > > is performed - confusion matrix takes into consideration those > > > categories. > > > > > > > > Thanks a lot for helping me out > > > > Jakub > > > > > > > > > > > > > > -- > > Jakub Stransky > > cz.linkedin.com/in/jakubstransky > > -- Jakub Stransky cz.linkedin.com/in/jakubstransky
