Re: SparkSQL to read XML Blob data to create multiple rows

Judit Planas Thu, 29 Jun 2017 06:59:44 -0700

Hi Ayan,

I'm afraid I can't help you here. I'm a beginner in Spark with 0experience in Scala (always used PySpark).

In my case, I had a list of numbers, and the explode function wasexactly what I was looking for.


Sorry about that, hope you can solve it.

Best,
Judit

On 29/06/17 15:38, Talap, Amol wrote:


Thanks Judit, Ayan

Judit,

You almost got it. The explode might help here.

But when I tried I see load() doesn’t like to read from xmlcommentcolumn on oracle_data.


scala> val xmlDF = sqlContext.sql("SELECT * FROM oracle_data")

17/06/29 18:31:58 INFO parse.ParseDriver: Parsing command: SELECT *FROM oracle_data


17/06/29 18:31:58 INFO parse.ParseDriver: Parse Completed

…

scala> val xmlDF_flattened =xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))


<console>:22: error: overloaded method value load with alternatives:

()org.apache.spark.sql.DataFrame <and>

(path: String)org.apache.spark.sql.DataFrame

cannot be applied to (org.apache.spark.sql.ColumnName)

val xmlDF_flattened =xmlDF.withColumn("xmlcomment",explode(sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load($"xmlcomment")))


Ayan,

Output of books_inexp.show was as below
title,                             author
Midnight Rain,            Ralls, Kim
Maeve Ascendant,      Corets, Eva

Regards,

Amol

*From:*Judit Planas [mailto:judit.pla...@epfl.ch]
*Sent:* Thursday, June 29, 2017 3:46 AM
*To:* user@spark.apache.org
*Subject:* Re: SparkSQL to read XML Blob data to create multiple rows

Hi Amol,

Not sure I understand completely your question, but the SQL function"explode" may help you:

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.explode

Here you can find a nice example:
https://stackoverflow.com/questions/38210507/explode-in-pyspark

HTH,
Judit

On 29/06/17 09:05, ayan guha wrote:

    Hi

    Not sure if I follow your issue. Can you please post output of
    books_inexp.show()?

    On Thu, Jun 29, 2017 at 2:30 PM, Talap, Amol
    <amol.ta...@capgemini.com <mailto:amol.ta...@capgemini.com>> wrote:

        Hi:

        We are trying to parse XML data to get below output from given
        input sample.

        Can someone suggest a way to pass one DFrames output into
        load() function or any other alternative to get this output.

        Input Data from Oracle Table XMLBlob:

        *SequenceID*

                

        *Name*

                

        *City*

                

        *XMLComment*

        1

                

        Amol

                

        Kolhapur

                

        
<books><Comments><Comment><Title>Title1.1</Title><Description>Description_1.1</Description><Comment><Title>Title1.2</Title><Description>Description_1.2</Description><Comment><Title>Title1.3</Title><Description>Description_1.3</Description></Comment></Comments></books>

        2

                

        Suresh

                

        Mumbai

                

        
<books><Comments><Comment><Title>Title2</Title><Description>Description_2</Description></Comment></Comments></books>

        3

                

        Vishal

                

        Delhi

                

        
<books><Comments><Comment><Title>Title3</Title><Description>Description_3</Description></Comment></Comments></books>

        4

                

        Swastik

                

        Bangalore

                

        
<books><Comments><Comment><Title>Title4</Title><Description>Description_4</Description></Comment></Comments></books>

        Output Data Expected using Spark SQL:

        *SequenceID*

                

        *Name*

                

        *City*

                

        *Title*

                

        *Description*

        1

                

        Amol

                

        Kolhapur

                

        Title1.1

                

        Description_1.1

        1

                

        Amol

                

        Kolhapur

                

        Title1.1

                

        Description_1.2

        1

                

        Amol

                

        Kolhapur

                

        Title1.3

                

        Description_1.3

        2

                

        Suresh

                

        Mumbai

                

        Title2

                

        Description_2

        3

                

        Vishal

                

        Delhi

                

        Title3.1

                

        Description_3.1

        4

                

        Swastik

                

        Bangalore

                

        Title4

                

        Description_4

        I am able to parse single XML using below approach in
        spark-shell using example below but how do we apply the same
        recursively for all rows ?

        
https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html.



        val dfX =
        
sqlContext.read.format("com.databricks.spark.xml").option("rowTag","book").load("books.xml")

        val xData = dfX.registerTempTable("books")

        dfX.printSchema()

        val books_inexp =sqlContext.sql("select title,author from
        books where price<10")

        books_inexp.show

        Regards,

        Amol

        This message contains information that may be privileged or
        confidential and is the property of the Capgemini Group. It is
        intended only for the person to whom it is addressed. If you
        are not the intended recipient, you are not authorized to
        read, print, retain, copy, disseminate, distribute, or use
        this message or any part thereof. If you receive this message
        in error, please notify the sender immediately and delete all
        copies of this message.

--

    Best Regards,
    Ayan Guha

This message contains information that may be privileged orconfidential and is the property of the Capgemini Group. It isintended only for the person to whom it is addressed. If you are notthe intended recipient, you are not authorized to read, print, retain,copy, disseminate, distribute, or use this message or any partthereof. If you receive this message in error, please notify thesender immediately and delete all copies of this message.

Re: SparkSQL to read XML Blob data to create multiple rows

Reply via email to