Hello Experts,

I am using Spark XML package to parse the XML. Below exception is being thrown 
when trying to parse a tag which exist in arrays of array depth. i.e. in this 
case subordinate_clerk.xxxx .duty.name

With below sample XML, issue is reproducible:

<emplist>
  <emp>
   <manager>
    <id>1</id>
    <name>mgr1</name>
    <dateOfJoin>2005-07-31</dateOfJoin>
    <subordinates>
      <subordinate_clerk>
        <cid>2</cid>
        <cname>clerk2</cname>
        <dateOfJoin>2005-07-31</dateOfJoin>
      </subordinate_clerk>
      <subordinate_clerk>
        <cid>3</cid>
        <cname>clerk3</cname>
        <dateOfJoin>2005-07-31</dateOfJoin>
      </subordinate_clerk>
    </subordinates>
   </manager>
  </emp>
  <emp>
   <manager>
   <id>11</id>
   <name>mgr11</name>
    <subordinates>
      <subordinate_clerk>
        <cid>12</cid>
        <cname>clerk12</cname>
        <duties>
          <duty>
            <name>first duty</name>
          </duty>
          <duty>
            <name>second duty</name>
          </duty>
        </duties>
      </subordinate_clerk>
    </subordinates>
   </manager>
  </emp>
</emplist>      


scala> df.select( 
"manager.subordinates.subordinate_clerk.duties.duty.name").show       

Exception is:
 org.apache.spark.sql.AnalysisException: cannot resolve 
'manager.subordinates.subordinate_clerk.duties.duty[name]' due to data type 
mismatch: argument 2 requires integral type, however, 'name' is of string type.;
        at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
        at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
        at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
        at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
        at scala.collection.AbstractIterator.to(Iterator.scala:1157)
        at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
        at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:321)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:332)
... more




scala> df.printSchema
root
 |-- manager: struct (nullable = true)
 |    |-- dateOfJoin: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- subordinates: struct (nullable = true)
 |    |    |-- subordinate_clerk: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- cid: long (nullable = true)
 |    |    |    |    |-- cname: string (nullable = true)
 |    |    |    |    |-- dateOfJoin: string (nullable = true)
 |    |    |    |    |-- duties: struct (nullable = true)
 |    |    |    |    |    |-- duty: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- name: string (nullable = true)



Versions info:
Spark - 1.6.0
Scala - 2.10.5
Spark XML - com.databricks:spark-xml_2.10:0.3.3

Please let me know if there is a solution or workaround for this?

Thanks,
Sreekanth

Reply via email to