Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

Kelly, Jonathan Wed, 26 Nov 2014 18:02:06 -0800

After playing around with this a little more, I discovered that:

1. If test.json contains something like {"values":[null,1,2,3]}, the
schema auto-determined by SchemaRDD.jsonFile() will have "element: integer
(containsNull = true)", and then
SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course
makes sense but doesn't really help).
2. If I specify the schema myself (e.g., sqlContext.jsonFile("test.json",
StructType(Seq(StructField("values", ArrayType(IntegerType, true),
true))))), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto()
work, though as I mentioned before, this is less than ideal.


Why don't saveAsTable/insertInto work when the containsNull properties
don't match?  I can understand how inserting data with containsNull=true
into a column where containsNull=false might fail, but I think the other
way around (which is the case here) should work.

~ Jonathan


On 11/26/14, 5:23 PM, "Kelly, Jonathan" <jonat...@amazon.com> wrote:

>I've noticed some strange behavior when I try to use
>SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
>that contains elements with nested arrays.  For example, with a file
>test.json that contains the single line:
>
>       {"values":[1,2,3]}
>
>and with code like the following:
>
>scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>scala> val test = sqlContext.jsonFile("test.json")
>scala> test.saveAsTable("test")
>
>it creates the table but fails when inserting the data into it.  Here¹s
>the exception:
>
>scala.MatchError: ArrayType(IntegerType,true) (of class
>org.apache.spark.sql.catalyst.types.ArrayType)
>       at 
>org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:
>2
>47)
>       at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
>       at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
>       at 
>org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal
>a
>:84)
>       at 
>org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
>l
>y(Projection.scala:66)
>       at 
>org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
>l
>y(Projection.scala:50)
>       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>       at 
>org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$s
>q
>l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc
>a
>la:149)
>       at 
>org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
>e
>File$1.apply(InsertIntoHiveTable.scala:158)
>       at 
>org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
>e
>File$1.apply(InsertIntoHiveTable.scala:158)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>       at org.apache.spark.scheduler.Task.run(Task.scala:54)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>       at 
>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
>1
>145)
>       at 
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
>:
>615)
>       at java.lang.Thread.run(Thread.java:745)
>
>I'm guessing that this is due to the slight difference in the schemas of
>these tables:
>
>scala> test.printSchema
>root
> |-- values: array (nullable = true)
> |    |-- element: integer (containsNull = false)
>
>
>scala> sqlContext.table("test").printSchema
>root
> |-- values: array (nullable = true)
> |    |-- element: integer (containsNull = true)
>
>If I reload the file using the schema that was created for the Hive table
>then try inserting the data into the table, it works:
>
>scala> sqlContext.jsonFile("file:///home/hadoop/test.json",
>sqlContext.table("test").schema).insertInto("test")
>scala> sqlContext.sql("select * from test").collect().foreach(println)
>[ArrayBuffer(1, 2, 3)]
>
>Does this mean that there is a bug with how the schema is being
>automatically determined when you use HiveContext.jsonFile() for JSON
>files that contain nested arrays?  (i.e., should containsNull be true for
>the array elements?)  Or is there a bug with how the Hive table is created
>from the SchemaRDD?  (i.e., should containsNull in fact be false?)  I can
>probably get around this by defining the schema myself rather than using
>auto-detection, but for now I¹d like to use auto-detection.
>
>By the way, I'm using Spark 1.1.0.
>
>Thanks,
>Jonathan
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

Reply via email to