> 
> (2) I am wondering if I use Wikipedia dataset as the input to the K-means 
> clustering, (thus no need to label the data), then I can get a relatively 
> large dataset, and both K-means and NB use the SequenceFileFormat.
> 
> 

Thinking of this again, you could run seqwiki with the -all option set, and 
pass the output of that to seq2sparse.

> and then run this sequence file through seq2sparse and kmeans as is done in 
> the cluster-reuters.sh example(starting at line 109):
> 
> 
> https://github.com/andrewpalumbo/mahout/blob/master/examples/bin/cluster-reuters.sh
>     
> 
> 
> 
> 
> It seems that I would just need to bypass the label data part and go directly 
> to the vectorization, I am not sure if it is feasible ?
> 
> 
> 
> 
> Thanks a lot !
> 
> 
> 
> Wei
> 
> 
> 
> Andrew Palumbo ---08/21/2014 02:28:45 PM---Hello, Yes, If you work off of the 
> current trunk, you can use the classify-wiki.sh example.  There i
> 
> 
> 
> From: Andrew Palumbo <[email protected]>
> 
> To:   "[email protected]" <[email protected]>
> 
> Date: 08/21/2014 02:28 PM
> 
> Subject:      RE: any pointer to run wikipedia bayes example
> 
> 
> 
> 
> 
> 
> 
> 
> Hello,
> 
> 
> 
> Yes, If you work off of the current trunk, you can use the classify-wiki.sh 
> example.  There is currently no documentation on the Mahout site for this.
> 
> 
> 
> You can run this script to build and test an NB classifier for option (1) 10 
> arbitrary countries or option (2) 2 countries (United States and United 
> Kingdom)
> 
> 
> 
> By defult the script is set to run on a medium sized  wikipedia XML dump.  To 
> run on the full set you'll have to change the download by commenting out line 
> 78, and uncommenting line 80 [1].  *Be sure to clean your work directory when 
> changing datasets- option (3).*
> 
> 
> 
> 
> 
> The step by step process for  Creating a Naive Bayes Classifier for the 
> wikipedia XML dump is very similar to creating the the 20 Newsgroups 
> Classifier.  The only difference being that instead of running $mahout 
> seqdirectory on the unzipped 20 Newsgroups file, you'll run $mahout seqwiki 
> on the unzipped wikipedia xml dump.
> 
> 
> 
> $ mahout seqwiki invokes WikipediaToSequenceFile.java which accepts a text 
> file of categories [2] and starts an MR job to parse the each document in the 
> XML file.  This process will seek to extract documents with category which 
> (exactly, if the exactMatchOnly option is set) matches a line in the category 
> file.  If no match is found and the -all option is set, the document will be 
> dumped into an "unknown" category.
> 
> The documents will then be written out as a <Text,Text> sequence file of the 
> form (K: /category/document_title , V: document) .
> 
> 
> 
> There are 3 different example category files available to in the 
> /examples/src/test/resources directory:  country.txt, country10.txt and 
> country2.txt.
> 
> 
> 
> The CLI options for seqwiki are as follows:
> 
> 
> 
>     -input           (-i)             input pathname String
> 
>     -output         (-o)           the output pathname String
> 
>     -categories  (-c)            the file containing the Wikipedia categories
> 
>     -exactMatchOnly (-e)    if set, then the Wikipedia category must match 
> exactly instead of simply containing the category string
> 
>     -all              (-all)            if set select all categories 
> 
> 
> 
> From there you just need to run  seq2sparse, split, trainnb, testnb as in the 
> example script.
> 
> 
> 
> Especially for the Binary classification problem you should have better 
> results using 3 or 4-grams and a low maxDF cuttoff like 30.
> 
> 
> 
> [1] https://github.com/apache/mahout/blob/master/examples/bin/classify-wiki.sh
> 
> [2] 
> https://github.com/apache/mahout/blob/master/examples/src/test/resources/country10.txt
> 
> 
> 
> 
> 
> Subject: Re: any pointer to run wikipedia bayes example
> 
> To: [email protected]
> 
> From: [email protected]
> 
> Date: Wed, 20 Aug 2014 09:50:42 -0400
> 
> 
> 
> 
> 
> hi, 
> 
> 
> 
> 
> 
> 
> 
> After did a bit more searching, I found 
> https://issues.apache.org/jira/browse/MAHOUT-1527
> 
> 
> 
> The version of Mahout that I have been working on is Mahout 0.9 (from 
> http://mahout.apache.org/general/downloads.html), which I downloaded in April.
> 
> 
> 
> Albeit the latest stable release, it doesn't include the patch mentioned in 
> https://issues.apache.org/jira/browse/MAHOUT-1527
> 
> 
> 
> 
> 
> 
> 
> Then I realized had I cloned the latest mahout, I would get a script that 
> classify-wiki.sh, and probably can start from there.  
> 
> 
> 
> 
> 
> 
> 
>  Sorry for the spam! 
> 
> 
> 
> 
> 
> 
> 
> Thanks,
> 
> 
> 
> Wei
> 
> 
> 
> 
> 
> 
> 
> Wei Zhang---08/19/2014 06:18:09 PM---Hi, I have been able to run the bayesian 
> network 20news group example provided
> 
> 
> 
> 
> 
> 
> 
> From:          Wei Zhang/Watson/IBM@IBMUS
> 
> 
> 
> To:            [email protected]
> 
> 
> 
> Date:          08/19/2014 06:18 PM
> 
> 
> 
> Subject:               any pointer to run wikipedia bayes example
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Hi,
> 
> 
> 
> 
> 
> 
> 
> I have been able to run the bayesian network 20news group example provided
> 
> 
> 
> at Mahout website.
> 
> 
> 
> 
> 
> 
> 
> I am interested in running the Wikipedia bayes example, as it is a much
> 
> 
> 
> larger dataset.
> 
> 
> 
> From several googling attempts,  I figured it is a bit different workflow
> 
> 
> 
> than running the 20news group example -- e.g., I would need to provide a
> 
> 
> 
> categories.txt file, and invoke WikipediaXmlSplitter,  call
> 
> 
> 
> wikipediaDataSetCreator and etc.
> 
> 
> 
> 
> 
> 
> 
> I am wondering is there a document somewhere that describes the process of
> 
> 
> 
> running Wikipedia bayes example ?
> 
> 
> 
> https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html  seems no
> 
> 
> 
> longer work.
> 
> 
> 
> 
> 
> 
> 
> Greatly appreciated!
> 
> 
> 
> 
> 
> 
> 
> Wei
> 
>                                                                               
>    
>                                         
                                          

Reply via email to