You would need to write an Xml Input Format that can parse XML into lines
based on start/end tags
Mahout has a XMLInputFormat implementation you should be able to import:
https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/XmlInputFormat.java

Once you have such a format, you can use Spark's Hadoop API to read the XML
into Strings

sc.newAPIHadoopFile(path,classOf[XMLInputFormat],classOf[NullWritable],classOf[Text])

Ram


On Sun, Jul 19, 2015 at 10:38 AM, Ashish Soni <asoni.le...@gmail.com> wrote:

> Hi All ,
>
> I have an XML file with same tag repeated multiple times as below , Please
> suggest what would be best way to process this data inside spark as ...
>
> How can i extract each open and closing tag and process them or how can i
> combine multiple line into single line
>
> <review>
> </review>
> <review>
> </review>
> ...
> ..
> ..
>
> Thanks,
>
  • XML Parsing Ashish Soni
    • Re: XML Parsing Ram Sriharsha

Reply via email to