Hi, We are working on hive DB with our Hadoop cluster. We now facing an issue about joining a big partition with more than 2^31 rows.
When the partition has more than 2147483648 rows (even 2147483649) the output of the join is a single row. When the partition has less than 2147483648 rows (event 2147483647) the output is correct. Our test case: create a table with 2147483649 rows in a partition with the value : "1" , join this table to another table with a single row,single column with the value "1" on the partition_key. later delete 2 rows and run the same join. 1st : only a single row is created 2nd : 2147483647 rows the query we run for test the case is: create table output_rows_over as select a.s1 from max_sint_rows a join small_table b on (a.p1=b.p1); on more than 2^31 rows we got the following on reducer log: 2013-05-27 21:51:14,186 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: TABLE_ID_1_ROWCOUNT:1 On less than 2^31 rows we got the following reducer log: 2013-05-27 23:43:14,681 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: TABLE_ID_1_ROWCOUNT:2147483647 Anyone faced this issue? Does hive has workaround for that? I have huge partitions I need to work on and I cannot use hive for that.. Thanks, Gabi Kazav Infrastructure Team Leader, Pursway.com