subject:"Parsing wikipedia xml data in Spark"

Re: Parsing wikipedia xml data in Spark

2014-04-28 Thread Geoffroy Fouquier

We did it using scala xml with spark We start by creating a rdd containing each page is store as a single line : - split the xml dump with xml_split - process each split with a shell script which remove "xml_split" tag and siteinfo section, and put each page on a single line. - copy resu

Parsing wikipedia xml data in Spark

2014-04-26 Thread Ajay Nair

Is there a way in spark to parse wikipedia xml dump? It seems like the freebase dump is longer available. Also does the spark shell support the xml load file sax parser that is present in scala. Thanks AJ