here is an example: df1 = df0.select(explode("manager.subordinates.subordinate_clerk <http://manager.subordinates.subordinate_clerk.duties.duty.name/>.duties).alias("duties-flat"), col("duties-flat.duty.name"").alias("duty-name"))
this is in pyspark, i may have some part of this wrong, didn't test it, but something similar. On Wed, Aug 31, 2016 at 5:54 PM, <srikanth.je...@gmail.com> wrote: > How do we explode nested arrays? > > > > Thanks, > Sreekanth Jella > > > > *From: *Peyman Mohajerian <mohaj...@gmail.com> > *Sent: *Wednesday, August 31, 2016 7:41 PM > *To: *srikanth.je...@gmail.com > *Cc: *user@spark.apache.org > *Subject: *Re: AnalysisException exception while parsing XML > > > > Once you get to the 'Array' type, you got to use explode, you cannot to > the same traversing. > > > > On Wed, Aug 31, 2016 at 2:19 PM, <srikanth.je...@gmail.com> wrote: > > Hello Experts, > > > > I am using Spark XML package to parse the XML. Below exception is being > thrown when trying to *parse a tag which exist in arrays of array depth*. > i.e. in this case subordinate_clerk.xxxx .duty.name > > > > With below sample XML, issue is reproducible: > > > > <emplist> > > <emp> > > <manager> > > <id>1</id> > > <name>mgr1</name> > > <dateOfJoin>2005-07-31</dateOfJoin> > > <subordinates> > > <subordinate_clerk> > > <cid>2</cid> > > <cname>clerk2</cname> > > <dateOfJoin>2005-07-31</dateOfJoin> > > </subordinate_clerk> > > <subordinate_clerk> > > <cid>3</cid> > > <cname>clerk3</cname> > > <dateOfJoin>2005-07-31</dateOfJoin> > > </subordinate_clerk> > > </subordinates> > > </manager> > > </emp> > > <emp> > > <manager> > > <id>11</id> > > <name>mgr11</name> > > <subordinates> > > <subordinate_clerk> > > <cid>12</cid> > > <cname>clerk12</cname> > > <duties> > > <duty> > > <name>first duty</name> > > </duty> > > <duty> > > <name>second duty</name> > > </duty> > > </duties> > > </subordinate_clerk> > > </subordinates> > > </manager> > > </emp> > > </emplist> > > > > > > scala> df.select( > "manager.subordinates.subordinate_clerk.duties.duty.name").show > > > > Exception is: > > org.apache.spark.sql.AnalysisException: cannot resolve > 'manager.subordinates.subordinate_clerk.duties.duty[name]' due to data type > mismatch: argument 2 requires integral type, however, 'name' is of string > type.; > > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65) > > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57) > > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334) > > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332) > > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332) > > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281) > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:321) > > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:332) > > ... more > > > > > > > > > > scala> df.printSchema > > root > > |-- manager: struct (nullable = true) > > | |-- dateOfJoin: string (nullable = true) > > | |-- id: long (nullable = true) > > | |-- name: string (nullable = true) > > | |-- subordinates: struct (nullable = true) > > | | |-- subordinate_clerk: array (nullable = true) > > | | | |-- element: struct (containsNull = true) > > | | | | |-- cid: long (nullable = true) > > | | | | |-- cname: string (nullable = true) > > | | | | |-- dateOfJoin: string (nullable = true) > > | | | | |-- duties: struct (nullable = true) > > | | | | | |-- duty: array (nullable = true) > > | | | | | | |-- element: struct (containsNull = true) > > | | | | | | | |-- name: string (nullable = true) > > > > > > > > Versions info: > > Spark - 1.6.0 > > Scala - 2.10.5 > > Spark XML - com.databricks:spark-xml_2.10:0.3.3 > > > > Please let me know if there is a solution or workaround for this? > > > > Thanks, > > Sreekanth > > > > > > >