I am getting an exception when joining two tables with Amazon's Hive 0.8.1 on Amazon EMR, and I've run out of ideas on how to fix it.
The query is something along the lines of Q1: SELECT count(*) FROM t1 x JOIN t2 y ON (x.id = y.x_id); Which ends up throwing an exception like this in some of the mappers: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"T1C1":"t2c1\u0001t2c2\u0001t2c3\u0001t2c4\u0001t2c5\u0001t2c6\u0001t2c7\u0001t2c8\u0001t2c8\u0001t2c9\u0001t2c10\u0001t2c11\u0001t2c12\u0001null\u0001null\u0001null\u0001null\u0001t2c18","T1C2":null,"T1C3":null,"T1C4":null,"T1C5":null} ... ... ... Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop .io.IntWritable at org.apache.hadoop.hive.serde2.lazy.objectinspector.primitive.LazyIntObjectInspector.get(Laz yIntObjectInspector.java:38) ... ... ... where TxCy is the name of the yth column in the xth table, and txcy is the value of the yth column in the xth table for this row. It looks like the deserializer for table 2 is getting an incorrectly formatted row from table 1 and is not splitting it on \u0001 as appears to be intended. Self joins and selecting from either table separately works fine and all rows are deserialized correctly in those cases. Here are the schemas for the tables: CREATE EXTERNAL TABLE t1 ( c1 STRING, ... c5 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3n://t1/'; CREATE EXTERNAL TABLE t2 ( c1 STRING, ... c18 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3n://t2/'; Finally, changing Q1 above to use a map-side join: Q2: SELECT /*+ MAPJOIN(x) */ count(*) FROM t1 x JOIN t2 y ON (x.id = y.x_id); prevents the exception from occurring at all. Is this a known bug in Apache Hive 0.8.1 or Amazon's 0.8.1 version? If so, is there a fix or non-mapside workaround? Thanks, Anthony