Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

chutium Sun, 31 Aug 2014 11:58:22 -0700

Hi Cheng, thank you very much for helping me to finally find out the secret
of this magic...


actually we defined this external table with
    SID STRING
    REQUEST_ID STRING
    TIMES_DQ TIMESTAMP
    TOTAL_PRICE FLOAT
    ...

using "desc table ext_fullorders" it is only shown as
[# col_name             data_type               comment             ]
...
[times_dq               string                  from deserializer   ]
[total_price            string                  from deserializer   ]
...
because, as you said, CSVSerde sets all field object inspectors to
javaStringObjectInspector
and therefore there are comments "from deserializer"

but in StorageDescriptor, are the real user defined types,
using "desc extended table ext_fullorders" we can see his
sd:StorageDescriptor
is:
FieldSchema(name:times_dq, type:timestamp, comment:null),
FieldSchema(name:total_price, type:float, comment:null)

and Spark HiveContext reads the schema info from this StorageDescriptor
https://github.com/apache/spark/blob/7e191fe29bb09a8560cd75d453c4f7f662dff406/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L316

so, in the SchemaRDD, the fields in Row were filled with strings (via
fillObject, all of values were retrieved from CSVSerDe with
javaStringObjectInspector)

but Spark considers that some of them are float or timestamp (schema info
were got from sd:StorageDescriptor)

crazy...

and sorry for update on the weekend...

a little more about how i fand this problem and why it is a trouble for us.

we use the new spark thrift server, to query normal managed hive table, it
works fine

but when we try to access the external tables with custom SerDe such as this
CSVSerDe, then we will get this ClassCastException, such as:
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Float

the reason is
https://github.com/apache/spark/blob/d94a44d7caaf3fe7559d9ad7b10872fa16cf81ca/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/server/SparkSQLOperationManager.scala#L104-L105

here Spark's thrift server try to get a float value from SparkRow, because
in the schema info (sd:StorageDescriptor) this column is float, but actually
in SparkRow, this field was filled with string value...



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8157.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

Reply via email to