Re: spark-xml can't recognize schema

Dave Moyers Sun, 21 Feb 2016 08:13:00 -0800

Make sure the xml input file is well formed (check your end tags). 

Sent from my iPhone


> On Feb 21, 2016, at 8:14 AM, Prathamesh Dharangutte <pratham.d...@gmail.com> 
> wrote:
> 
> This is the code I am using for parsing xml file: 
> 
> 
> 
> import org.apache.spark.{SparkConf,SparkContext}
> import org.apache.spark.sql.{DataFrame,SQLContext}
> import com.databricks.spark.xml
> 
> 
> object XmlProcessing {
> 
> def main(args : Array[String]) = {
> 
>     val conf = new SparkConf()
>         .setAppName("XmlProcessing")
>         .setMaster("local")
> 
>     val sc = new SparkContext(conf)
>     val sqlContext : SQLContext = new org.apache.spark.sql.SQLContext(sc)
>     
>     loadXMLdata(sqlContext)        
>     
>     }
> 
> def loadXMLdata(sqlContext : SQLContext) = {
> 
>     var df : DataFrame = null
>     
>     var newDf : DataFrame = null
> 
>     df = sqlContext.read
>         .format("com.databricks.spark.xml")
>         .option("rowTag","book")
>         .load("/home/prathamsh/Workspace/Xml/datafiles/sample.xml")    
>         
>     df.printSchema()
>     
>     
>     }
> 
> }
> 
> 
> 
> 
> 
> 
>> On Sun, Feb 21, 2016 at 7:10 PM, Sebastian Piu <sebastian....@gmail.com> 
>> wrote:
>> Can you paste the code you are using?
>> 
>> 
>>> On Sun, 21 Feb 2016, 13:19 Prathamesh Dharangutte <pratham.d...@gmail.com> 
>>> wrote:
>>> I am trying to parse xml file using spark-xml. But for some reason when i 
>>> print schema it only shows  root instead of the hierarchy. I am using 
>>> sqlcontext to read the data. I am proceeding according to this video :    
>>> https://www.youtube.com/watch?v=NemEp53yGbI
>>> 
>>> The structure of xml file is somewhat like this:
>>> 
>>> <books>
>>>   <book>
>>>      <name></name>
>>>      <price></price>
>>>      <orderId></orderId>
>>>   </book>  
>>>    <book>
>>>        //Some more data
>>>    </book>
>>> </books>
>>> 
>>> For some books there,are multiple orders i.e. large number of orders while 
>>> for some it just occurs once as empty. I use the "rowtag" attribute as 
>>> book. How do i proceed or is there any other way to tackle this problem?  
>>> Help would be much appreciated. Thank you.
>

Re: spark-xml can't recognize schema

Reply via email to