Hello Mahout experts,
I am trying to follow some examples provided with Mahout and some features
are not clear to me. It would be great if someone could clarify a bit more.
To prepare a the data (train and test) the following sequence of steps is
perfomed (taken from mahout cookbook):
All input is merged into single dir:
*cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all*
Converted to hadoop sequence file and then vectorized:
*./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-**vectors
-lnorm -nv -wt tfidf*
Devided to test and train data:
*./mahout split*
*-i ${WORK_DIR}/20news-vectors/tfidf-vectors*
*--trainingOutput ${WORK_DIR}/20news-train-vectors*
*--testOutput ${WORK_DIR}/20news-test-vectors*
*--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential*
Model is trained:
*./mahout trainnb*
*-i ${WORK_DIR}/20news-train-vectors -el*
*-o ${WORK_DIR}/model*
*-li ${WORK_DIR}/labelindex*
*-ow*
What I am missing here and that is subject of my question is: Where is the
category assigned to the testing data to train the categorization? What I
would expect is that there will be vector which says that this document
belongs to a particular category. This seems to me has been ereased by
first step where we mixed all the data to create our corpus. I would still
expect that this information will be somewhere retained. Instead the
messages looks as follows:
From: [email protected] (YEO YEK CHONG)
Subject: Re: Is "Kermit" available for Windows 3.0/3.1?
Organization: Oklahoma State University
Lines: 7
>From article <[email protected]>, by Steve Frampton <
[email protected]>:
> I was wondering, is the "Kermit" package (the actual package, not a
Yes! In the usual ftp sites.
Yek CHong
There is no notion from which group this text belongs to. What's the hack!
Could someone please clarify a bit what's going on as when crosswalidation
is performed - confusion matrix takes into consideration those categories.
Thanks a lot for helping me out
Jakub