Another option might be to leverage spark-xml-utils (https://github.com/dmcbeath/spark-xml-utils)
This is a collection of xml utilities that I've recently revamped that make it relatively easy to use xpath, xslt, or xquery within the context of a Spark application (or at least I think so). My previous attempt was not overly friendly, but as I've learned more about Spark (and needed easier to use xml utilities) I've hopefully made this easier to use and understand. I hope others find it useful. Back to your problem. Assuming you have a bunch of xml records in an RDD, you should be able to do something like the following to count the number of elements for that type. In the example below, I'm counting the number of references in documents. The xmlKeyPair is an RDD of type (String,String) where the first item is the 'key' and the second item is the xml record. The xpath expression identifies the 'reference' element I want to count. import com.elsevier.spark_xml_utils.xpath.XPathProcessor import scala.collection.JavaConverters._ import java.util.HashMap xmlKeyPair.mapPartitions(recsIter => { val xpath = "count(/xocs:doc/xocs:meta/xocs:references/xocs:ref-info)" val namespaces = new HashMap[String,String](Map( "xocs" -> "http://www.elsevier.com/xml/xocs/dtd" ).asJava) val proc = XPathProcessor.getInstance(xpath,namespaces) recsIter.map(rec => proc.evaluateString(rec._2).toInt) }).sum There is more documentation on the spark-xml-utils github site. Let me know if the documentation is not clear or if you have any questions. Darin. ________________________________ From: Rick Hillegas <rick.hille...@gmail.com> To: Sonal Goyal <sonalgoy...@gmail.com> Cc: rakesh sharma <rakeshsharm...@hotmail.com>; user@spark.apache.org Sent: Monday, August 31, 2015 10:51 AM Subject: Re: Reading xml in java using spark Hi Rakesh, You might also take a look at the Derby code. org.apache.derby.vti.XmlVTI provides a number of static methods for turning an XML resource into a JDBC ResultSet. Thanks, -Rick On 8/31/15 4:44 AM, Sonal Goyal wrote: I think the mahout project had an xmlinoutformat which you can leverage. >On Aug 31, 2015 5:10 PM, "rakesh sharma" <rakeshsharm...@hotmail.com> wrote: > >I want to parse an xml file in spark >>But as far as example is concerned it reads it as text file. The maping to >>xml will be a tedious job. >>How can I find the number of elements of a particular type using that. Any >>help in java/scala code is also welcome >> >> >>thanks >>rakesh --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org