Hello Andrew,

I have given a try to NB on the medium size  Wikipedia (~1GB data after
decompression, roughly 1/50 of the full Wikipedia size) with two categories
(US/UK) I examined the tf-idf vectors generated.

I have two questions:
(1) It seems  there are (only) 11683 data points (i.e., documents)
generated, albeit each data point has relatively high dimension. 10K data
points seem not very exciting, even I multiply it by 50 ( to the full
extend of Wikipedia dataset), it seems the data points are not particularly
many.

I suspect that many of the documents are not categorized as either US or
UK, thus not included in the training set. On a 20 node (8 cores
each)( cluster (albeit a quite old one, 5 years old), it took 45 minutes to
label/vectorize  the dataset, but only 3 minutes to train the NB.

I am wondering is there a way to get a larger dataset that can stress the
NB training (instead of the label/vectorization part) either by providing a
more inclusive category file or choosing another dataset ?

With a more inclusive category file, I can potentially get a larger
dataset, but I don't know how to handle the case where a document has two
labels in that category file.

(2) I am wondering if I use Wikipedia dataset as the input to the K-means
clustering, (thus no need to label the data), then I can get a relatively
large dataset, and both K-means and NB use the SequenceFileFormat.

It seems that I would just need to bypass the label data part and go
directly to the vectorization, I am not sure if it is feasible ?

Thanks a lot !

Wei



From:   Andrew Palumbo <[email protected]>
To:     "[email protected]" <[email protected]>
Date:   08/21/2014 02:28 PM
Subject:        RE: any pointer to run wikipedia bayes example



Hello,

Yes, If you work off of the current trunk, you can use the classify-wiki.sh
example.  There is currently no documentation on the Mahout site for this.

You can run this script to build and test an NB classifier for option (1)
10 arbitrary countries or option (2) 2 countries (United States and United
Kingdom)

By defult the script is set to run on a medium sized  wikipedia XML dump.
To run on the full set you'll have to change the download by commenting out
line 78, and uncommenting line 80 [1].  *Be sure to clean your work
directory when changing datasets- option (3).*


The step by step process for  Creating a Naive Bayes Classifier for the
wikipedia XML dump is very similar to creating the the 20 Newsgroups
Classifier.  The only difference being that instead of running $mahout
seqdirectory on the unzipped 20 Newsgroups file, you'll run $mahout seqwiki
on the unzipped wikipedia xml dump.

$ mahout seqwiki invokes WikipediaToSequenceFile.java which accepts a text
file of categories [2] and starts an MR job to parse the each document in
the XML file.  This process will seek to extract documents with category
which (exactly, if the exactMatchOnly option is set) matches a line in the
category file.  If no match is found and the -all option is set, the
document will be dumped into an "unknown" category.
The documents will then be written out as a <Text,Text> sequence file of
the form (K: /category/document_title , V: document) .

There are 3 different example category files available to in
the /examples/src/test/resources directory:  country.txt, country10.txt and
country2.txt.

The CLI options for seqwiki are as follows:

    -input           (-i)             input pathname String
    -output         (-o)           the output pathname String
    -categories  (-c)            the file containing the Wikipedia
categories
    -exactMatchOnly (-e)    if set, then the Wikipedia category must match
exactly instead of simply containing the category string
    -all              (-all)            if set select all categories

>From there you just need to run  seq2sparse, split, trainnb, testnb as in
the example script.

Especially for the Binary classification problem you should have better
results using 3 or 4-grams and a low maxDF cuttoff like 30.

[1]
https://github.com/apache/mahout/blob/master/examples/bin/classify-wiki.sh
[2]
https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt



Subject: Re: any pointer to run wikipedia bayes example
To: [email protected]
From: [email protected]
Date: Wed, 20 Aug 2014 09:50:42 -0400


hi,



After did a bit more searching, I found
https://issues.apache.org/jira/browse/MAHOUT-1527

The version of Mahout that I have been working on is Mahout 0.9 (from
http://mahout.apache.org/general/downloads.html), which I downloaded in
April.

Albeit the latest stable release, it doesn't include the patch mentioned in
https://issues.apache.org/jira/browse/MAHOUT-1527



Then I realized had I cloned the latest mahout, I would get a script that
classify-wiki.sh, and probably can start from there.



 Sorry for the spam!



Thanks,

Wei



Wei Zhang---08/19/2014 06:18:09 PM---Hi, I have been able to run the
bayesian network 20news group example provided



From:            Wei Zhang/Watson/IBM@IBMUS

To:              [email protected]

Date:            08/19/2014 06:18 PM

Subject:                 any pointer to run wikipedia bayes example












Hi,



I have been able to run the bayesian network 20news group example provided

at Mahout website.



I am interested in running the Wikipedia bayes example, as it is a much

larger dataset.

>From several googling attempts,  I figured it is a bit different workflow

than running the 20news group example -- e.g., I would need to provide a

categories.txt file, and invoke WikipediaXmlSplitter,  call

wikipediaDataSetCreator and etc.



I am wondering is there a document somewhere that describes the process of

running Wikipedia bayes example ?

https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html  seems no

longer work.



Greatly appreciated!



Wei

Reply via email to