> > (2) I am wondering if I use Wikipedia dataset as the input to the K-means > clustering, (thus no need to label the data), then I can get a relatively > large dataset, and both K-means and NB use the SequenceFileFormat. > >
Thinking of this again, you could run seqwiki with the -all option set, and pass the output of that to seq2sparse. > and then run this sequence file through seq2sparse and kmeans as is done in > the cluster-reuters.sh example(starting at line 109): > > > https://github.com/andrewpalumbo/mahout/blob/master/examples/bin/cluster-reuters.sh > > > > > > It seems that I would just need to bypass the label data part and go directly > to the vectorization, I am not sure if it is feasible ? > > > > > Thanks a lot ! > > > > Wei > > > > Andrew Palumbo ---08/21/2014 02:28:45 PM---Hello, Yes, If you work off of the > current trunk, you can use the classify-wiki.sh example. There i > > > > From: Andrew Palumbo <[email protected]> > > To: "[email protected]" <[email protected]> > > Date: 08/21/2014 02:28 PM > > Subject: RE: any pointer to run wikipedia bayes example > > > > > > > > > Hello, > > > > Yes, If you work off of the current trunk, you can use the classify-wiki.sh > example. There is currently no documentation on the Mahout site for this. > > > > You can run this script to build and test an NB classifier for option (1) 10 > arbitrary countries or option (2) 2 countries (United States and United > Kingdom) > > > > By defult the script is set to run on a medium sized wikipedia XML dump. To > run on the full set you'll have to change the download by commenting out line > 78, and uncommenting line 80 [1]. *Be sure to clean your work directory when > changing datasets- option (3).* > > > > > > The step by step process for Creating a Naive Bayes Classifier for the > wikipedia XML dump is very similar to creating the the 20 Newsgroups > Classifier. The only difference being that instead of running $mahout > seqdirectory on the unzipped 20 Newsgroups file, you'll run $mahout seqwiki > on the unzipped wikipedia xml dump. > > > > $ mahout seqwiki invokes WikipediaToSequenceFile.java which accepts a text > file of categories [2] and starts an MR job to parse the each document in the > XML file. This process will seek to extract documents with category which > (exactly, if the exactMatchOnly option is set) matches a line in the category > file. If no match is found and the -all option is set, the document will be > dumped into an "unknown" category. > > The documents will then be written out as a <Text,Text> sequence file of the > form (K: /category/document_title , V: document) . > > > > There are 3 different example category files available to in the > /examples/src/test/resources directory: country.txt, country10.txt and > country2.txt. > > > > The CLI options for seqwiki are as follows: > > > > -input (-i) input pathname String > > -output (-o) the output pathname String > > -categories (-c) the file containing the Wikipedia categories > > -exactMatchOnly (-e) if set, then the Wikipedia category must match > exactly instead of simply containing the category string > > -all (-all) if set select all categories > > > > From there you just need to run seq2sparse, split, trainnb, testnb as in the > example script. > > > > Especially for the Binary classification problem you should have better > results using 3 or 4-grams and a low maxDF cuttoff like 30. > > > > [1] https://github.com/apache/mahout/blob/master/examples/bin/classify-wiki.sh > > [2] > https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt > > > > > > Subject: Re: any pointer to run wikipedia bayes example > > To: [email protected] > > From: [email protected] > > Date: Wed, 20 Aug 2014 09:50:42 -0400 > > > > > > hi, > > > > > > > > After did a bit more searching, I found > https://issues.apache.org/jira/browse/MAHOUT-1527 > > > > The version of Mahout that I have been working on is Mahout 0.9 (from > http://mahout.apache.org/general/downloads.html), which I downloaded in April. > > > > Albeit the latest stable release, it doesn't include the patch mentioned in > https://issues.apache.org/jira/browse/MAHOUT-1527 > > > > > > > > Then I realized had I cloned the latest mahout, I would get a script that > classify-wiki.sh, and probably can start from there. > > > > > > > > Sorry for the spam! > > > > > > > > Thanks, > > > > Wei > > > > > > > > Wei Zhang---08/19/2014 06:18:09 PM---Hi, I have been able to run the bayesian > network 20news group example provided > > > > > > > > From: Wei Zhang/Watson/IBM@IBMUS > > > > To: [email protected] > > > > Date: 08/19/2014 06:18 PM > > > > Subject: any pointer to run wikipedia bayes example > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > I have been able to run the bayesian network 20news group example provided > > > > at Mahout website. > > > > > > > > I am interested in running the Wikipedia bayes example, as it is a much > > > > larger dataset. > > > > From several googling attempts, I figured it is a bit different workflow > > > > than running the 20news group example -- e.g., I would need to provide a > > > > categories.txt file, and invoke WikipediaXmlSplitter, call > > > > wikipediaDataSetCreator and etc. > > > > > > > > I am wondering is there a document somewhere that describes the process of > > > > running Wikipedia bayes example ? > > > > https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html seems no > > > > longer work. > > > > > > > > Greatly appreciated! > > > > > > > > Wei > > > >
