> Hello,
>
>
>
> I have Hadoop running on HDFS with Hive installed. I am able to import
Wikipedia dump into HDFS through the below command:
>
>
>
>
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
>
>
>
> $ hadoop jar out.jar
edu.umd.cloud9.collection.wikipedia.DumpWikipediaToPlainText -input
/home/wikimedia/input/ enwiki-latest-pages-articles.xml -output
/home/wikimedia/output/3
>
>
>
> I am able to run Hive for the Wikipedia dump through this command:
>
>
>
> I have created one sample hive table based on small data I converted:
>
>
>
> CREATE EXTERNAL TABLE wiki_page(page_title string, page_body string)
>
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>
> STORED AS TEXTFILE
>
> LOCATION '/home/wikimedia/output/3';
>
>
>
> It created for me a record as shown below:
>
>
>
> Davy Jones (musician) Davy Jones (musician) David Thomas "Davy"
Jones (30 December 1945 – 29 February 2012) was an English recording
artist and actor, best known as a member of The Monkees. Early lifeDavy
Jones was born at 20 Leamington Street, Openshaw, Manchester, England, on
30 December 1945. At age 11, he began his acting career…
>
>
>
> My overall objective is to know how many contributors are from India and
China.
>
> Any suggestion how to achieve that?
>
>
>
>
>
>
>
>
>
>