Hi Ayan,
I'm afraid I can't help you here. I'm a beginner in Spark with 0
experience in Scala (always used PySpark).
In my case, I had a list of numbers, and the explode function was
exactly what I was looking for.
Sorry about that, hope you can solve it.
Best,
Judit
On 29/06/17 15:38, Talap, Amol wrote:
Thanks Judit, Ayan
Judit,
You almost got it. The explode might help here.
But when I tried I see load() doesn’t like to read from xmlcomment
column on oracle_data.
scala> val xmlDF = sqlContext.sql("SELECT * FROM oracle_data")
17/06/29 18:31:58 INFO parse.ParseDriver: Parsing command: SELECT *
FROM oracle_data
17/06/29 18:31:58 INFO parse.ParseDriver: Parse Completed
…
scala> val xmlDF_flattened =
xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))
<console>:22: error: overloaded method value load with alternatives:
()org.apache.spark.sql.DataFrame <and>
(path: String)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.sql.ColumnName)
val xmlDF_flattened =
xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))
Ayan,
Output of books_inexp.show was as below
title, author
Midnight Rain, Ralls, Kim
Maeve Ascendant, Corets, Eva
Regards,
Amol
*From:*Judit Planas [mailto:judit.pla...@epfl.ch]
*Sent:* Thursday, June 29, 2017 3:46 AM
*To:* user@spark.apache.org
*Subject:* Re: SparkSQL to read XML Blob data to create multiple rows
Hi Amol,
Not sure I understand completely your question, but the SQL function
"explode" may help you:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode
Here you can find a nice example:
https://stackoverflow.com/questions/38210507/explode-in-pyspark
HTH,
Judit
On 29/06/17 09:05, ayan guha wrote:
Hi
Not sure if I follow your issue. Can you please post output of
books_inexp.show()?
On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol
<amol.ta...@capgemini.com <mailto:amol.ta...@capgemini.com>> wrote:
Hi:
We are trying to parse XML data to get below output from given
input sample.
Can someone suggest a way to pass one DFrames output into
load() function or any other alternative to get this output.
Input Data from Oracle Table XMLBlob:
*SequenceID*
*Name*
*City*
*XMLComment*
1
Amol
Kolhapur
<books><Comments><Comment><Title>Title1.1</Title><Description>Description_1.1</Description><Comment><Title>Title1.2</Title><Description>Description_1.2</Description><Comment><Title>Title1.3</Title><Description>Description_1.3</Description></Comment></Comments></books>
2
Suresh
Mumbai
<books><Comments><Comment><Title>Title2</Title><Description>Description_2</Description></Comment></Comments></books>
3
Vishal
Delhi
<books><Comments><Comment><Title>Title3</Title><Description>Description_3</Description></Comment></Comments></books>
4
Swastik
Bangalore
<books><Comments><Comment><Title>Title4</Title><Description>Description_4</Description></Comment></Comments></books>
Output Data Expected using Spark SQL:
*SequenceID*
*Name*
*City*
*Title*
*Description*
1
Amol
Kolhapur
Title1.1
Description_1.1
1
Amol
Kolhapur
Title1.1
Description_1.2
1
Amol
Kolhapur
Title1.3
Description_1.3
2
Suresh
Mumbai
Title2
Description_2
3
Vishal
Delhi
Title3.1
Description_3.1
4
Swastik
Bangalore
Title4
Description_4
I am able to parse single XML using below approach in
spark-shell using example below but how do we apply the same
recursively for all rows ?
https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html.
val dfX =
sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load("books.xml")
val xData = dfX.registerTempTable("books")
dfX.printSchema()
val books_inexp =sqlContext.sql("select title,author from
books where price<10")
books_inexp.show
Regards,
Amol
This message contains information that may be privileged or
confidential and is the property of the Capgemini Group. It is
intended only for the person to whom it is addressed. If you
are not the intended recipient, you are not authorized to
read, print, retain, copy, disseminate, distribute, or use
this message or any part thereof. If you receive this message
in error, please notify the sender immediately and delete all
copies of this message.
--
Best Regards,
Ayan Guha
This message contains information that may be privileged or
confidential and is the property of the Capgemini Group. It is
intended only for the person to whom it is addressed. If you are not
the intended recipient, you are not authorized to read, print, retain,
copy, disseminate, distribute, or use this message or any part
thereof. If you receive this message in error, please notify the
sender immediately and delete all copies of this message.