Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

Kelly, Jonathan Thu, 27 Nov 2014 12:09:33 -0800

Yeah, only a few hours after I sent my message I saw some correspondence on 
this other thread: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-complex-types-like-map-lt-string-map-lt-string-int-gt-gt-in-spark-sql-td19603.html,
 which is the exact same issue.  Glad to find that this should be fixed in 
1.2.0!  I'll give that a try later.


Thanks a lot,
Jonathan

From: Yin Huai <huaiyin....@gmail.com<mailto:huaiyin....@gmail.com>>
Date: Thursday, November 27, 2014 at 4:37 PM
To: Jonathan Kelly <jonat...@amazon.com<mailto:jonat...@amazon.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded 
from a JSON file using schema auto-detection

Hello Jonathan,

There was a bug regarding casting data types before inserting into a Hive 
table. Hive does not have the notion of "containsNull" for array values. So, 
for a Hive table, the containsNull will be always true for an array and we 
should ignore this field for Hive. This issue has been fixed by 
https://issues.apache.org/jira/browse/SPARK-4245, which will be released with 
1.2.

Thanks,

Yin

On Wed, Nov 26, 2014 at 9:01 PM, Kelly, Jonathan 
<jonat...@amazon.com<mailto:jonat...@amazon.com>> wrote:
After playing around with this a little more, I discovered that:

1. If test.json contains something like {"values":[null,1,2,3]}, the
schema auto-determined by SchemaRDD.jsonFile() will have "element: integer
(containsNull = true)", and then
SchemaRDD.saveAsTable()/SchemaRDD.insertInto() will work (which of course
makes sense but doesn't really help).
2. If I specify the schema myself (e.g., sqlContext.jsonFile("test.json",
StructType(Seq(StructField("values", ArrayType(IntegerType, true),
true))))), that also makes SchemaRDD.saveAsTable()/SchemaRDD.insertInto()
work, though as I mentioned before, this is less than ideal.

Why don't saveAsTable/insertInto work when the containsNull properties
don't match?  I can understand how inserting data with containsNull=true
into a column where containsNull=false might fail, but I think the other
way around (which is the case here) should work.

~ Jonathan


On 11/26/14, 5:23 PM, "Kelly, Jonathan" 
<jonat...@amazon.com<mailto:jonat...@amazon.com>> wrote:

>I've noticed some strange behavior when I try to use
>SchemaRDD.saveAsTable() with a SchemaRDD that I¹ve loaded from a JSON file
>that contains elements with nested arrays.  For example, with a file
>test.json that contains the single line:
>
>       {"values":[1,2,3]}
>
>and with code like the following:
>
>scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
>scala> val test = sqlContext.jsonFile("test.json")
>scala> test.saveAsTable("test")
>
>it creates the table but fails when inserting the data into it.  Here¹s
>the exception:
>
>scala.MatchError: ArrayType(IntegerType,true) (of class
>org.apache.spark.sql.catalyst.types.ArrayType)
>       at
>org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:
>2
>47)
>       at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
>       at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
>       at
>org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scal
>a
>:84)
>       at
>org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
>l
>y(Projection.scala:66)
>       at
>org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.app
>l
>y(Projection.scala:50)
>       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>       at
>org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org<http://org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org>$apache$spark$s
>q
>l$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.sc
>a
>la:149)
>       at
>org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
>e
>File$1.apply(InsertIntoHiveTable.scala:158)
>       at
>org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiv
>e
>File$1.apply(InsertIntoHiveTable.scala:158)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>       at org.apache.spark.scheduler.Task.run(Task.scala:54)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>       at
>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
>1
>145)
>       at
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
>:
>615)
>       at java.lang.Thread.run(Thread.java:745)
>
>I'm guessing that this is due to the slight difference in the schemas of
>these tables:
>
>scala> test.printSchema
>root
> |-- values: array (nullable = true)
> |    |-- element: integer (containsNull = false)
>
>
>scala> sqlContext.table("test").printSchema
>root
> |-- values: array (nullable = true)
> |    |-- element: integer (containsNull = true)
>
>If I reload the file using the schema that was created for the Hive table
>then try inserting the data into the table, it works:
>
>scala> sqlContext.jsonFile("file:///home/hadoop/test.json",
>sqlContext.table("test").schema).insertInto("test")
>scala> sqlContext.sql("select * from test").collect().foreach(println)
>[ArrayBuffer(1, 2, 3)]
>
>Does this mean that there is a bug with how the schema is being
>automatically determined when you use HiveContext.jsonFile() for JSON
>files that contain nested arrays?  (i.e., should containsNull be true for
>the array elements?)  Or is there a bug with how the Hive table is created
>from the SchemaRDD?  (i.e., should containsNull in fact be false?)  I can
>probably get around this by defining the schema myself rather than using
>auto-detection, but for now I¹d like to use auto-detection.
>
>By the way, I'm using Spark 1.1.0.
>
>Thanks,
>Jonathan
>


---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>

Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

Reply via email to