Re: Wikipedia Dump Analysis..

Ajeet S Raina Tue, 08 Oct 2013 02:13:28 -0700

I am not restricted to finding contributor location.That was just one
thought which came to my mind.


I would like to know what analysis could be done with Wikipedia.

The Wikipedia is in the form of xml dump which is loaded into hdfs and hive
created two column for it.
On 8 Oct 2013 13:57, "Sonal Goyal" <sonalgoy...@gmail.com> wrote:

> Hi Ajeet,
>
> Unfortunately, many of us are not familiar with the Wikipedia format as to
> where the contributor information is coming from. If you could please
> highlight that and let us know where you are stuck with Hive, we could
> throw some ideas..
>
> Sonal
>
> Best Regards,
> Sonal
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>
> On Tue, Oct 8, 2013 at 6:39 AM, Ajeet S Raina <ajeetra...@gmail.com>wrote:
>
>> Any suggestion??
>> On 7 Oct 2013 11:24, "Ajeet S Raina" <ajeetra...@gmail.com> wrote:
>>
>>> I was just trying to see if some interesting analysis is possible or
>>> not.one thing which came to mind was tracking contributors and just thought
>>> about that.
>>>
>>> Is it really possible?
>>> On 7 Oct 2013 11:13, "Ajeet S Raina" <ajeetra...@gmail.com> wrote:
>>>
>>>> I could see that revision history could be the target factor but no
>>>> idea how to go for it. Any suggestion?
>>>> On 7 Oct 2013 10:34, "Sonal Goyal" <sonalgoy...@gmail.com> wrote:
>>>>
>>>>> Sorry, where is the contributor information coming from?
>>>>>
>>>>> Best Regards,
>>>>> Sonal
>>>>> Nube Technologies <http://www.nubetech.co>
>>>>>
>>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Oct 3, 2013 at 11:57 AM, Ajeet S Raina 
>>>>> <ajeetra...@gmail.com>wrote:
>>>>>
>>>>>>  > Hello,
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > I have Hadoop running on HDFS with Hive installed. I am able to
>>>>>> import Wikipedia dump into HDFS through the below command:
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > $ hadoop jar out.jar
>>>>>> edu.umd.cloud9.collection.wikipedia.DumpWikipediaToPlainText -input
>>>>>> /home/wikimedia/input/ enwiki-latest-pages-articles.xml  -output
>>>>>> /home/wikimedia/output/3
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > I am able to run Hive for the Wikipedia dump through this command:
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > I have created one sample hive table based on small data I
>>>>>> converted:
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > CREATE EXTERNAL TABLE wiki_page(page_title string, page_body string)
>>>>>> >
>>>>>> > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>>>>>> >
>>>>>> > STORED AS TEXTFILE
>>>>>> >
>>>>>> > LOCATION '/home/wikimedia/output/3';
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > It created for me a record as shown below:
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > Davy Jones (musician) Davy Jones (musician)           David Thomas
>>>>>> "Davy" Jones (30 December 1945 â€“ 29 February 2012) was an English
>>>>>> recording artist and actor, best known as a member of The Monkees. Early
>>>>>> lifeDavy Jones was born at 20 Leamington Street, Openshaw, Manchester,
>>>>>> England, on 30 December 1945. At age 11, he began his acting career…
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > My overall objective is to know how many contributors are from
>>>>>> India and China.
>>>>>> >
>>>>>> > Any suggestion how to achieve that?
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>>
>>>>>
>>>>>
>

Re: Wikipedia Dump Analysis..

Reply via email to