Hi,

We are working on hive DB with our Hadoop cluster.
We now facing an issue about joining a big partition with more than 2^31 rows.

When the partition has more than 2147483648 rows (even 2147483649) the output 
of the join is a single row.
When the partition has less than 2147483648 rows (event 2147483647) the output 
is correct.
Our test case:

create a table with 2147483649 rows in a partition with the value : "1" , join 
this table to another table with a single row,single column with the value "1" 
on the partition_key.
later delete 2 rows and run the same join.
1st : only a single row is created
2nd : 2147483647 rows
the query we run for test the case is:

create table output_rows_over as
select a.s1
from  max_sint_rows a join small_table b
on (a.p1=b.p1);

on more than 2^31 rows we got the following on reducer log:

2013-05-27 21:51:14,186 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: 
TABLE_ID_1_ROWCOUNT:1
On less than 2^31 rows we got the following reducer log:

2013-05-27 23:43:14,681 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: 
TABLE_ID_1_ROWCOUNT:2147483647


Anyone faced this issue?
Does hive has workaround for that?
I have huge partitions I need to work on and I cannot use hive for that..

Thanks,


Gabi Kazav
Infrastructure Team Leader, Pursway.com


Reply via email to