RE: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

Cheng, Hao Sun, 31 Aug 2014 18:28:41 -0700

Yes, the root cause for that is the output ObjectInspector in SerDe 
implementation doesn't reflect the real typeinfo.


Hive actually provides the API like 
TypeInfoUtils.getStandardJavaObjectInspectorFromTypeInfo(TypeInfo) for the 
mapping.

You probably need to update the code at 
https://github.com/ogrodnek/csv-serde/blob/master/src/main/java/com/bizo/hive/serde/csv/CSVSerde.java#L60.

-----Original Message-----
From: chutium [mailto:teng....@gmail.com] 
Sent: Monday, September 01, 2014 2:58 AM
To: d...@spark.incubator.apache.org
Subject: Re: HiveContext, schemaRDD.printSchema get different dataTypes, 
feature or a bug? really strange and surprised...

Hi Cheng, thank you very much for helping me to finally find out the secret of 
this magic...

actually we defined this external table with
    SID STRING
    REQUEST_ID STRING
    TIMES_DQ TIMESTAMP
    TOTAL_PRICE FLOAT
    ...

using "desc table ext_fullorders" it is only shown as
[# col_name             data_type               comment             ]
...
[times_dq               string                  from deserializer   ]
[total_price            string                  from deserializer   ]
...
because, as you said, CSVSerde sets all field object inspectors to 
javaStringObjectInspector and therefore there are comments "from deserializer"

but in StorageDescriptor, are the real user defined types, using "desc extended 
table ext_fullorders" we can see his sd:StorageDescriptor
is:
FieldSchema(name:times_dq, type:timestamp, comment:null), 
FieldSchema(name:total_price, type:float, comment:null)

and Spark HiveContext reads the schema info from this StorageDescriptor
https://github.com/apache/spark/blob/7e191fe29bb09a8560cd75d453c4f7f662dff406/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L316

so, in the SchemaRDD, the fields in Row were filled with strings (via 
fillObject, all of values were retrieved from CSVSerDe with
javaStringObjectInspector)

but Spark considers that some of them are float or timestamp (schema info were 
got from sd:StorageDescriptor)

crazy...

and sorry for update on the weekend...

a little more about how i fand this problem and why it is a trouble for us.

we use the new spark thrift server, to query normal managed hive table, it 
works fine

but when we try to access the external tables with custom SerDe such as this 
CSVSerDe, then we will get this ClassCastException, such as:
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Float

the reason is
https://github.com/apache/spark/blob/d94a44d7caaf3fe7559d9ad7b10872fa16cf81ca/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/server/SparkSQLOperationManager.scala#L104-L105

here Spark's thrift server try to get a float value from SparkRow, because in 
the schema info (sd:StorageDescriptor) this column is float, but actually in 
SparkRow, this field was filled with string value...



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035p8157.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

RE: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

Reply via email to