Re: Insights to Naive Bayes classifier example - 20news groups

Jakub Stransky Tue, 02 Dec 2014 05:07:34 -0800

Hi Andrew,

many thanks for final clarification! Now I have last question - probably
the most obvious but I missed it somewhere probably. Because all the
examples ends up by testing the classifier - display confusion matrix.  So
the state is:
We have a trained and tested model and now we would like to use the model
to classify  unseen, unknown data - actually use the classifier. For sure
it is clear how to prepare the input - vectorize etc. What is not clear to
me at the moment is how do I call trained model with new vectorized data as
an input. Or may be even the vectorization itself - because we need
probably the dictionary used by model to produce a valid vectors. What
about terms which we not in the training set etc.


Is there any documentation regarding this aspect?

Thx
Jakub



On 1 December 2014 at 21:12, Andrew Palumbo <[email protected]> wrote:

>
>
>
> > However the sequence of steps as described in Mahout Cookbook seems to me
> > incorrect as:
>
> this is entirely possible, that book may be out of date. The end to end
> instructions on the website for the 20 newsgroups example is up to date
> though.  As is the example script.
>
> You don't want to merge all of the files into one directory, rather to
> merge the training and testing sets in 20news-bydate while maintaining
> their directory structure.
>
> > After data set download and extraction data are merged via command:
> > *cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all*
> >
> > Which essentially copies files to a single location -> 20news-all folder
>
> this should not copy all of the *files* individually into the 20news-all
> folder rather the directories containing the files:
>
>     $ ls 20news-all/
>     alt.atheism               rec.autos           sci.space
>     comp.graphics             rec.motorcycles     soc.religion.christian
>     {...}
>
> > *./mahout seqdirectory  -i ${WORK_DIR}/20news-all  -o
> > ${WORK_DIR}/20news-seq*
> > Converts to a hadoop sequence directory from 20news-all dir - where all
> > files were copied and efffectively the classification to folders were
> lost.
> > We can peek inside a created seq file via hadoop fs -text
> > $WORK_DIR/20news-seq/chunck-0 | more which prints following result:
> >
> > */67399* From:xxx
> > Subject: Re: Imake-TeX: looking for beta testers
> > Organization: CS Department, Dortmund University, Germany
> > Lines: 59
> > Distribution: world
> > NNTP-Posting-Host: tommy.informatik.uni-dortmund.de
> > In article <xxxxx>,
> > yyy writes:
> > |> As I announced at the X Technical Conference in January, I would
> > like
> > |> to
> > |> make Imake-TeX, the Imake support for using the TeX typesetting
> > system,
> > |> publically available. Currently Imake-TeX is in beta test here at
> > the
> > |> computer science department of Dortmund University, and I am
> > looking
> > ...
> >
> > To my understanding - number after slash in bold represents a key of
> > sequence file, right?
>
> Correct though it should read something like:
>
>     /comp.graphics/67399 {...}
>
> where comp.graphics is the category as well as the directory that it was
> read in from.
>
> > Then seq2sparse is performed:
> >
> > ./mahout seq2sparse  -i ${WORK_DIR}/20news-seq vectors -lnorm -nv  -wt
> > tfidf -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf
> >
> >
> > *Conclusions which I would like to verify:*
> > - sequence of steps as described is incorrect - particularly conversion
> to
> > sequence file as the key doesn't contain folder name describing the
> > category of training data, or am I still missing something in here?
>
> yes- it looks like you are copying the individual files rather than the
> directories into 20news-all
>
> >
> > - mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o
> > ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow
> >   What are the exact mechanics when label extraction is performed e.g.
> > /category/docID as a key is resolved just to category ???
>
> yes
>
> > Does every time
> > the last part after the slash is dropped as a category?? Or is is
> possible
> > to define the strategy somewhere?
>
> The hard-coded convention as of Mahout 0.9 is to extract the label as the
> first string after the key is split on "/".  This makes category
> organization by directory and sequence file conversion with seqdirectory
> straightforward.  The new scala DSL Naive Bayes which is currently in
> development will allow the user more flexibility in extracting the label.
>
> The label extraction process can be found here:
>
> https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/classifier/naivebayes/training/IndexInstancesMapper.java
>
> and could me modified if need be.
>
> >
> > Thanks
> > Jakub
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On 1 December 2014 at 17:43, Andrew Palumbo <[email protected]> wrote:
> >
> > > Hi Jakub,
> > >
> > > The step that you are missing is `$mahout seqdir ...`.   in this step
> each
> > > file in each directory (where the directory is the Category) is
> converted
> > > into a sequence file of form <Text,Text>  where the Text key is
> > > /Category/doc_id.
> > >
> > > `$mahout seq2sparse ...` vectorizes the output of `$mahout seqdir ...`
> > > into a sequence file of form <Text, VectorWritable> leaving the Keys
> > > unchanged.
> > >
> > > `$mahout trainnb ... -el ...` then extracts the label from the Keys of
> the
> > > training data ie. the "Category" from /Category/doc_id.
> > >
> > > please see
> > > http://mahout.apache.org/users/classification/twenty-newsgroups.html
> > > and http://mahout.apache.org/users/classification/bayesian.html
> > > for more information.
> > >
> > > > Date: Mon, 1 Dec 2014 17:09:55 +0100
> > > > Subject: Insights to Naive Bayes classifier example - 20news groups
> > > > From: [email protected]
> > > > To: [email protected]
> > > >
> > > > Hello Mahout experts,
> > > >
> > > > I am trying to follow some examples provided with Mahout and some
> > > features
> > > > are not clear to me. It would be great if someone could clarify a bit
> > > more.
> > > >
> > > > To prepare a the data (train and test) the following sequence of
> steps is
> > > > perfomed (taken from mahout cookbook):
> > > >
> > > > All input is merged into single dir:
> > > > *cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*
> > > >
> > > > Converted to hadoop sequence file and then vectorized:
> > > > *./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o
> > > ${WORK_DIR}/20news-**vectors
> > > > -lnorm -nv -wt tfidf*
> > > >
> > > > Devided to test and train data:
> > > > *./mahout split*
> > > > *-i ${WORK_DIR}/20news-vectors/tfidf-vectors*
> > > > *--trainingOutput ${WORK_DIR}/20news-train-vectors*
> > > > *--testOutput ${WORK_DIR}/20news-test-vectors*
> > > > *--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential*
> > > >
> > > > Model is trained:
> > > > *./mahout trainnb*
> > > > *-i ${WORK_DIR}/20news-train-vectors -el*
> > > > *-o ${WORK_DIR}/model*
> > > > *-li ${WORK_DIR}/labelindex*
> > > > *-ow*
> > > >
> > > >
> > > > What I am missing here and that is subject of my question is: Where
> is
> > > the
> > > > category assigned to the testing data to train the categorization?
> What I
> > > > would expect is that there will be vector which says that this
> document
> > > > belongs to a particular category. This seems to me has been ereased
> by
> > > > first step where we mixed all the data to create our corpus. I would
> > > still
> > > > expect that this information will be somewhere retained. Instead the
> > > > messages looks as follows:
> > > >
> > > > From: [email protected] (YEO YEK CHONG)
> > > > Subject: Re: Is "Kermit" available for Windows 3.0/3.1?
> > > > Organization: Oklahoma State University
> > > > Lines: 7
> > > >
> > > > From article <[email protected]>, by Steve Frampton <
> > > > [email protected]>:
> > > > > I was wondering, is the "Kermit" package (the actual package, not a
> > > >
> > > > Yes!  In the usual ftp sites.
> > > >
> > > > Yek CHong
> > > >
> > > >
> > > > There is no notion from which group this text belongs to. What's the
> > > hack!
> > > >
> > > > Could someone please clarify a bit what's going on as when
> > > crosswalidation
> > > > is performed - confusion matrix takes into consideration those
> > > categories.
> > > >
> > > > Thanks a lot for helping me out
> > > > Jakub
> > >
> > >
> >
> >
> >
> > --
> > Jakub Stransky
> > cz.linkedin.com/in/jakubstransky
>
>



-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky

Re: Insights to Naive Bayes classifier example - 20news groups

Reply via email to