I am getting an exception when joining two tables with Amazon's Hive
0.8.1 on Amazon EMR, and I've run out of ideas on how to fix it.

The query is something along the lines of

Q1: SELECT count(*) FROM t1 x JOIN t2 y ON (x.id = y.x_id);

Which ends up throwing an exception like this in some of the mappers:

java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error
while
 processing row
{"T1C1":"t2c1\u0001t2c2\u0001t2c3\u0001t2c4\u0001t2c5\u0001t2c6\u0001t2c7\u0001t2c8\u0001t2c8\u0001t2c9\u0001t2c10\u0001t2c11\u0001t2c12\u0001null\u0001null\u0001null\u0001null\u0001t2c18","T1C2":null,"T1C3":null,"T1C4":null,"T1C5":null}
...
...
...
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text
cannot be cast to org.apache.hadoop
.io.IntWritable
        at 
org.apache.hadoop.hive.serde2.lazy.objectinspector.primitive.LazyIntObjectInspector.get(Laz
yIntObjectInspector.java:38)
...
...

... where TxCy is the name of the yth column in the xth table, and
txcy is the value of the yth column in the xth table for this row.

It looks like the deserializer for table 2 is getting an incorrectly
formatted row from table 1 and is not splitting it on \u0001 as
appears to be intended.

Self joins and selecting from either table separately works fine and
all rows are deserialized correctly in those cases.

Here are the schemas for the tables:

CREATE EXTERNAL TABLE  t1 (
c1 STRING,
...
c5 STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3n://t1/';

CREATE EXTERNAL TABLE  t2 (
c1 STRING,
...
c18 STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3n://t2/';

Finally, changing Q1 above to use a map-side join:

Q2: SELECT /*+ MAPJOIN(x) */ count(*) FROM t1 x JOIN t2 y ON (x.id = y.x_id);

prevents the exception from occurring at all.

Is this a known bug in Apache Hive 0.8.1 or Amazon's 0.8.1 version?
If so, is there a fix or non-mapside workaround?

Thanks,
Anthony

Reply via email to