Hi Kwon, Was trying out spark XML library . I keep on getting errors in inferring schema. Looks like it cannot infer single line XML data.
Sent from Samsung Mobile. -------- Original message -------- From: Hyukjin Kwon <[email protected]> Date:21/08/2016 15:40 (GMT+05:30) To: Jörn Franke <[email protected]> Cc: Diwakar Dhanuskodi <[email protected]>, Felix Cheung <[email protected]>, user <[email protected]> Subject: Re: Best way to read XML data from RDD Hi Diwakar, Spark XML library can take RDD as source. ``` val df = new XmlReader() .withRowTag("book") .xmlRdd(sqlContext, rdd) ``` If performance is critical, I would also recommend to take care of creation and destruction of the parser. If the parser is not serializble, then you can do the creation for each partition within mapPartition just like https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9bb188140423/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L322-L325 I hope this is helpful. 2016-08-20 15:10 GMT+09:00 Jörn Franke <[email protected]>: I fear the issue is that this will create and destroy a XML parser object 2 mio times, which is very inefficient - it does not really look like a parser performance issue. Can't you do something about the format choice? Ask your supplier to deliver another format (ideally avro or sth like this?)? Otherwise you could just create one XML Parser object / node, but sharing this among the parallel tasks on the same node is tricky. The other possibility could be simply more hardware ... On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi <[email protected]> wrote: Yes . It accepts a xml file as source but not RDD. The XML data embedded inside json is streamed from kafka cluster. So I could get it as RDD. Right now I am using spark.xml XML.loadstring method inside RDD map function but performance wise I am not happy as it takes 4 minutes to parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment. Sent from Samsung Mobile. -------- Original message -------- From: Felix Cheung <[email protected]> Date:20/08/2016 09:49 (GMT+05:30) To: Diwakar Dhanuskodi <[email protected]>, user <[email protected]> Cc: Subject: Re: Best way to read XML data from RDD Have you tried https://github.com/databricks/spark-xml ? On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" <[email protected]> wrote: Hi, There is a RDD with json data. I could read json data using rdd.read.json . The json data has XML data in couple of key-value paris. Which is the best method to read and parse XML from rdd. Is there any specific xml libraries for spark. Could anyone help on this. Thanks.
