is there any dataType auto convert or detect or something in HiveContext ? all columns of a table is defined as string in hive metastore
one column is total_price with values like 123.45, then this column will be recognized as dataType Float in HiveContext... this is a feature or a bug? it really surprised me... how is it implemented? if it is a feature, can i turn it off? i want to get a schemaRDD with exactly the same datatype defined in hive metadata, i know the column total_price should be float values, but they must not be, and what happens if there is some broken line in my huge CSV file? or maybe some total_price is 9,123.45 or $123.45 or something ============================================================== some example for this in our env. MapR v3 cluster, newest spark github master clone from yesterday built with sbt/sbt -Dhadoop.version=1.0.3-mapr-3.0.3 -Phive assembly hive-site.xml configured ============================================================== spark-shell scripts: val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.sql("use our_live_db") hiveContext.sql("desc formatted et_fullorders").collect.foreach(println) ... ... 14/08/26 15:47:09 INFO SparkContext: Job finished: collect at SparkPlan.scala:85, took 0.0305408 s [# col_name data_type comment ] [ ] [sid string from deserializer ] [request_id string from deserializer ] [*times_dq string* from deserializer ] [*total_price string* from deserializer ] [order_id string from deserializer ] [ ] [# Partition Information ] [# col_name data_type comment ] [ ] [wt_date string None ] [country string None ] [ ] [# Detailed Table Information ] [Database: our_live_db ] [Owner: client02 ] [CreateTime: Fri Jan 31 12:23:40 CET 2014 ] [LastAccessTime: UNKNOWN ] [Protect Mode: None ] [Retention: 0 ] [Location: maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders ] [Table Type: EXTERNAL_TABLE ] [Table Parameters: ] [ EXTERNAL TRUE ] [ transient_lastDdlTime 1391167420 ] [ ] [# Storage Information ] [SerDe Library: com.bizo.hive.serde.csv.CSVSerde ] [InputFormat: org.apache.hadoop.mapred.TextInputFormat ] [OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat ] [Compressed: No ] [Num Buckets: -1 ] [Bucket Columns: [] ] [Sort Columns: [] ] [Storage Desc Params: ] [ separatorChar ; ] [ serialization.format 1 ] then, create a schemaRDD from this table val result = hiveContext.sql("select sid, order_id, total_price, times_dq from et_fullorders where wt_date='2014-04-14' and country='uk' limit 5") ok now, printSchema... scala> result.printSchema root |-- sid: string (nullable = true) |-- order_id: string (nullable = true) |-- *total_price: float* (nullable = true) |-- *times_dq: timestamp* (nullable = true) total_price was STRING but now in schemaRDD is FLOAT and times_dq, now is TIMESTAMP really strange and surprised... and more strange is: scala> result.map(row => row.getString(2)).collect.foreach(println) i got 240.00 45.83 21.67 95.83 120.83 but scala> result.map(row => row.getFloat(2)).collect.foreach(println) 14/08/26 16:01:24 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 8) java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Float at scala.runtime.BoxesRunTime.unboxToFloat(BoxesRunTime.java:114) ============================================================== btw, files in this external table are gzipped csv files: 14/08/26 15:49:56 INFO HadoopRDD: Input split: maprfs:/mapr/cluster01.xxx.net/common/external_tables/et_fullorders/wt_date=2014-04-14/country=uk/getFullOrders_2014-04-14.csv.gz:0+16990 and the data in it: scala> result.collect.foreach(println) [5000000001402123123,12344000123454,240.00,2014-04-14 00:03:49.082000] [5000000001402110123,12344000123455,45.83,2014-04-14 00:04:13.639000] [5000000001402129123,12344000123458,21.67,2014-04-14 00:09:12.276000] [5000000001402092123,12344000132457,95.83,2014-04-14 00:09:42.228000] [5000000001402135123,12344000123460,120.83,2014-04-14 00:12:44.742000] we use CSVSerDe https://drone.io/github.com/ogrodnek/csv-serde/files/target/csv-serde-1.1.2-0.11.0-all.jar maybe this is a reason? but why the 1st and 2nd column, will not be recognized as bigint or double or something...? Thanks for any idea -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/HiveContext-schemaRDD-printSchema-get-different-dataTypes-feature-or-a-bug-really-strange-and-surpri-tp8035.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org