Hello,
Yes, If you work off of the current trunk, you can use the classify-wiki.sh
example. There is currently no documentation on the Mahout site for this.
You can run this script to build and test an NB classifier for option (1) 10
arbitrary countries or option (2) 2 countries (United States and United Kingdom)
By defult the script is set to run on a medium sized wikipedia XML dump. To
run on the full set you'll have to change the download by commenting out line
78, and uncommenting line 80 [1]. *Be sure to clean your work directory when
changing datasets- option (3).*
The step by step process for Creating a Naive Bayes Classifier for the
wikipedia XML dump is very similar to creating the the 20 Newsgroups
Classifier. The only difference being that instead of running $mahout
seqdirectory on the unzipped 20 Newsgroups file, you'll run $mahout seqwiki on
the unzipped wikipedia xml dump.
$ mahout seqwiki invokes WikipediaToSequenceFile.java which accepts a text file
of categories [2] and starts an MR job to parse the each document in the XML
file. This process will seek to extract documents with category which
(exactly, if the exactMatchOnly option is set) matches a line in the category
file. If no match is found and the -all option is set, the document will be
dumped into an "unknown" category.
The documents will then be written out as a <Text,Text> sequence file of the
form (K: /category/document_title , V: document) .
There are 3 different example category files available to in the
/examples/src/test/resources directory: country.txt, country10.txt and
country2.txt.
The CLI options for seqwiki are as follows:
-input (-i) input pathname String
-output (-o) the output pathname String
-categories (-c) the file containing the Wikipedia categories
-exactMatchOnly (-e) if set, then the Wikipedia category must match
exactly instead of simply containing the category string
-all (-all) if set select all categories
>From there you just need to run seq2sparse, split, trainnb, testnb as in the
>example script.
Especially for the Binary classification problem you should have better results
using 3 or 4-grams and a low maxDF cuttoff like 30.
[1] https://github.com/apache/mahout/blob/master/examples/bin/classify-wiki.sh
[2]
https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt
Subject: Re: any pointer to run wikipedia bayes example
To: [email protected]
From: [email protected]
Date: Wed, 20 Aug 2014 09:50:42 -0400
hi,
After did a bit more searching, I found
https://issues.apache.org/jira/browse/MAHOUT-1527
The version of Mahout that I have been working on is Mahout 0.9 (from
http://mahout.apache.org/general/downloads.html), which I downloaded in April.
Albeit the latest stable release, it doesn't include the patch mentioned in
https://issues.apache.org/jira/browse/MAHOUT-1527
Then I realized had I cloned the latest mahout, I would get a script that
classify-wiki.sh, and probably can start from there.
Sorry for the spam!
Thanks,
Wei
Wei Zhang---08/19/2014 06:18:09 PM---Hi, I have been able to run the bayesian
network 20news group example provided
From: Wei Zhang/Watson/IBM@IBMUS
To: [email protected]
Date: 08/19/2014 06:18 PM
Subject: any pointer to run wikipedia bayes example
Hi,
I have been able to run the bayesian network 20news group example provided
at Mahout website.
I am interested in running the Wikipedia bayes example, as it is a much
larger dataset.
>From several googling attempts, I figured it is a bit different workflow
than running the 20news group example -- e.g., I would need to provide a
categories.txt file, and invoke WikipediaXmlSplitter, call
wikipediaDataSetCreator and etc.
I am wondering is there a document somewhere that describes the process of
running Wikipedia bayes example ?
https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html seems no
longer work.
Greatly appreciated!
Wei