After playing around with this a little more, I discovered that: 1. If test.json contains something like {"values":[null,1,2,3]}, the schema auto-determined by SchemaRDD.jsonFile() will have "element: integer (containsNull = true)", and then SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course makes sense but doesn't really help). 2. If I specify the schema myself (e.g., sqlContext.jsonFile("test.json", StructType(Seq(StructField("values", ArrayType(IntegerType, true), true))))), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto() work, though as I mentioned before, this is less than ideal.
Why don't saveAsTable/insertInto work when the containsNull properties don't match? I can understand how inserting data with containsNull=true into a column where containsNull=false might fail, but I think the other way around (which is the case here) should work. ~ Jonathan On 11/26/14, 5:23 PM, "Kelly, Jonathan" <jonat...@amazon.com> wrote: >I've noticed some strange behavior when I try to use >SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file >that contains elements with nested arrays. For example, with a file >test.json that contains the single line: > > {"values":[1,2,3]} > >and with code like the following: > >scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) >scala> val test = sqlContext.jsonFile("test.json") >scala> test.saveAsTable("test") > >it creates the table but fails when inserting the data into it. Here¹s >the exception: > >scala.MatchError: ArrayType(IntegerType,true) (of class >org.apache.spark.sql.catalyst.types.ArrayType) > at >org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala: >2 >47) > at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) > at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) > at >org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal >a >:84) > at >org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app >l >y(Projection.scala:66) > at >org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app >l >y(Projection.scala:50) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at >org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$s >q >l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc >a >la:149) > at >org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv >e >File$1.apply(InsertIntoHiveTable.scala:158) > at >org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv >e >File$1.apply(InsertIntoHiveTable.scala:158) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > at >java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: >1 >145) > at >java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java >: >615) > at java.lang.Thread.run(Thread.java:745) > >I'm guessing that this is due to the slight difference in the schemas of >these tables: > >scala> test.printSchema >root > |-- values: array (nullable = true) > | |-- element: integer (containsNull = false) > > >scala> sqlContext.table("test").printSchema >root > |-- values: array (nullable = true) > | |-- element: integer (containsNull = true) > >If I reload the file using the schema that was created for the Hive table >then try inserting the data into the table, it works: > >scala> sqlContext.jsonFile("file:///home/hadoop/test.json", >sqlContext.table("test").schema).insertInto("test") >scala> sqlContext.sql("select * from test").collect().foreach(println) >[ArrayBuffer(1, 2, 3)] > >Does this mean that there is a bug with how the schema is being >automatically determined when you use HiveContext.jsonFile() for JSON >files that contain nested arrays? (i.e., should containsNull be true for >the array elements?) Or is there a bug with how the Hive table is created >from the SchemaRDD? (i.e., should containsNull in fact be false?) I can >probably get around this by defining the schema myself rather than using >auto-detection, but for now I¹d like to use auto-detection. > >By the way, I'm using Spark 1.1.0. > >Thanks, >Jonathan > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org