You would need to write an Xml Input Format that can parse XML into lines based on start/end tags Mahout has a XMLInputFormat implementation you should be able to import: https://github.com/apache/mahout/blob/master/integration/src/main/java/org/apache/mahout/text/wikipedia/XmlInputFormat.java
Once you have such a format, you can use Spark's Hadoop API to read the XML into Strings sc.newAPIHadoopFile(path,classOf[XMLInputFormat],classOf[NullWritable],classOf[Text]) Ram On Sun, Jul 19, 2015 at 10:38 AM, Ashish Soni <asoni.le...@gmail.com> wrote: > Hi All , > > I have an XML file with same tag repeated multiple times as below , Please > suggest what would be best way to process this data inside spark as ... > > How can i extract each open and closing tag and process them or how can i > combine multiple line into single line > > <review> > </review> > <review> > </review> > ... > .. > .. > > Thanks, >