RE: any pointer to run wikipedia bayes example

Andrew Palumbo Wed, 27 Aug 2014 14:18:07 -0700


Subject: RE: any pointer to run wikipedia bayes example
To: [email protected]
From: [email protected]
Date: Tue, 26 Aug 2014 18:12:52 -0400



Hello Andrew,



I have given a try to NB on the medium size  Wikipedia (~1GB data after 
decompression, roughly 1/50 of the full Wikipedia size) with two categories 
(US/UK) I examined the tf-idf vectors generated. 



I have two questions:

(1) It seems  there are (only) 11683 data points (i.e., documents) generated, 
albeit each data point has relatively high dimension. 10K data points seem not 
very exciting, even I multiply it by 50 ( to the full extend of Wikipedia 
dataset), it seems the data points are not particularly many.




I suspect that many of the documents are not categorized as either US or UK, 
thus not included in the training set. On a 20 node (8 cores each)( cluster 
(albeit a quite old one, 5 years old), it took 45 minutes to label/vectorize  
the dataset, but only 3 minutes to train the NB.






If you used option (2) from the classify-wiki.sh script,  seq2sparse will be 
vectorizing the data using 4-grams which take much longer and give you a much 
larger feature set.  option (1) uses bigrams.   






I am wondering is there a way to get a larger dataset that can stress the NB 
training (instead of the label/vectorization part) either by providing a more 
inclusive category file or choosing another dataset ?     








You could run on the the full country set:


https://github.com/apache/mahout/blob/master/examples/src/test/resources/country.txt


By editing line 101 or 107 to read:


    cp $MAHOUT_HOME/examples/src/test/resources/country.txt 
${WORK_DIR}/country.txt


However on the medium data set, this only yields ~38200 documents so it still 
probably will not be not be the size that you are looking for. Alternatively, 
you could create your own category.txt file to use and pass it to the -c 
argument.

As well you could try turning the -all option which as we discussed before will 
likely skew the categories into an "unknown" category, but will not reject any 
documents






With a more inclusive category file, I can potentially get a larger dataset, 
but I don't know how to handle the case where a document has two labels in that 
category file. 






Currently, the WikipediaMapper is labeling the document as the first matching 
category that it finds, but you can customize this however you'd like.  






(2) I am wondering if I use Wikipedia dataset as the input to the K-means 
clustering, (thus no need to label the data), then I can get a relatively large 
dataset, and both K-means and NB use the SequenceFileFormat.






I believe this this should work- you could remove the labeling section - 
basically lines 79-85  from WikipediaMapper.java 



https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/WikipediaMapper.java


and write out something like (K=document_title,V=document) to the sequence 
file. 



and then run this sequence file through seq2sparse and kmeans as is done in the 
cluster-reuters.sh example(starting at line 109):


https://github.com/andrewpalumbo/mahout/blob/master/examples/bin/cluster-reuters.sh
    




It seems that I would just need to bypass the label data part and go directly 
to the vectorization, I am not sure if it is feasible ?




Thanks a lot !



Wei



Andrew Palumbo ---08/21/2014 02:28:45 PM---Hello, Yes, If you work off of the 
current trunk, you can use the classify-wiki.sh example.  There i



From:   Andrew Palumbo <[email protected]>

To:     "[email protected]" <[email protected]>

Date:   08/21/2014 02:28 PM

Subject:        RE: any pointer to run wikipedia bayes example








Hello,



Yes, If you work off of the current trunk, you can use the classify-wiki.sh 
example.  There is currently no documentation on the Mahout site for this.



You can run this script to build and test an NB classifier for option (1) 10 
arbitrary countries or option (2) 2 countries (United States and United Kingdom)



By defult the script is set to run on a medium sized  wikipedia XML dump.  To 
run on the full set you'll have to change the download by commenting out line 
78, and uncommenting line 80 [1].  *Be sure to clean your work directory when 
changing datasets- option (3).*





The step by step process for  Creating a Naive Bayes Classifier for the 
wikipedia XML dump is very similar to creating the the 20 Newsgroups 
Classifier.  The only difference being that instead of running $mahout 
seqdirectory on the unzipped 20 Newsgroups file, you'll run $mahout seqwiki on 
the unzipped wikipedia xml dump.



$ mahout seqwiki invokes WikipediaToSequenceFile.java which accepts a text file 
of categories [2] and starts an MR job to parse the each document in the XML 
file.  This process will seek to extract documents with category which 
(exactly, if the exactMatchOnly option is set) matches a line in the category 
file.  If no match is found and the -all option is set, the document will be 
dumped into an "unknown" category.

The documents will then be written out as a <Text,Text> sequence file of the 
form (K: /category/document_title , V: document) .



There are 3 different example category files available to in the 
/examples/src/test/resources directory:  country.txt, country10.txt and 
country2.txt.



The CLI options for seqwiki are as follows:



    -input           (-i)             input pathname String

    -output         (-o)           the output pathname String

    -categories  (-c)            the file containing the Wikipedia categories

    -exactMatchOnly (-e)    if set, then the Wikipedia category must match 
exactly instead of simply containing the category string

    -all              (-all)            if set select all categories 



>From there you just need to run  seq2sparse, split, trainnb, testnb as in the 
>example script.



Especially for the Binary classification problem you should have better results 
using 3 or 4-grams and a low maxDF cuttoff like 30.



[1] https://github.com/apache/mahout/blob/master/examples/bin/classify-wiki.sh

[2] 
https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt





Subject: Re: any pointer to run wikipedia bayes example

To: [email protected]

From: [email protected]

Date: Wed, 20 Aug 2014 09:50:42 -0400





hi, 







After did a bit more searching, I found 
https://issues.apache.org/jira/browse/MAHOUT-1527



The version of Mahout that I have been working on is Mahout 0.9 (from 
http://mahout.apache.org/general/downloads.html), which I downloaded in April.



Albeit the latest stable release, it doesn't include the patch mentioned in 
https://issues.apache.org/jira/browse/MAHOUT-1527







Then I realized had I cloned the latest mahout, I would get a script that 
classify-wiki.sh, and probably can start from there.  







 Sorry for the spam! 







Thanks,



Wei







Wei Zhang---08/19/2014 06:18:09 PM---Hi, I have been able to run the bayesian 
network 20news group example provided







From:            Wei Zhang/Watson/IBM@IBMUS



To:              [email protected]



Date:            08/19/2014 06:18 PM



Subject:                 any pointer to run wikipedia bayes example

























Hi,







I have been able to run the bayesian network 20news group example provided



at Mahout website.







I am interested in running the Wikipedia bayes example, as it is a much



larger dataset.



>From several googling attempts,  I figured it is a bit different workflow



than running the 20news group example -- e.g., I would need to provide a



categories.txt file, and invoke WikipediaXmlSplitter,  call



wikipediaDataSetCreator and etc.







I am wondering is there a document somewhere that describes the process of



running Wikipedia bayes example ?



https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html  seems no



longer work.







Greatly appreciated!







Wei

RE: any pointer to run wikipedia bayes example

Reply via email to