Hi, I took it one step ahead and dropped the partition: hive> desc max_int; OK s1 int Time taken: 0.238 seconds
hive> desc small_table; OK s1 int Time taken: 0.187 seconds Max_int has 2^31+1 records, which them all the number 1: hive> select * from max_int limit 5; OK 1 1 1 1 1 Time taken: 0.813 seconds Small_table has 1 record , the number 1: hive> select * from small_table; OK 1 Time taken: 0.122 seconds I am running: create table output_rows as select a.s1 from max_int a join small_table b on (a.s1=b.s1); and getting 1 row at the end...: hive> select count (*) from output_rows; OK 1 Time taken: 14.892 seconds if I filling the table max_int with 2^31-1 records of the int 1, I am getting the right data at the end.. (output_rows 2 is the table I created with the join query, on the 2^31-1 record table): hive> select count (*) from output_rows2; OK 2147483647 Time taken: 472.449 seconds -----Original Message----- From: John Meagher [mailto:john.meag...@gmail.com] Sent: Thursday, May 30, 2013 12:06 AM To: user@hive.apache.org Subject: Re: Hive - max rows limit (int limit = 2^31). need Help (looks liek a bug) What is the data type of the p1 column? I've used hive with partitions containing far above 2 billion rows without having any problems like this. On Wed, May 29, 2013 at 2:41 PM, Gabi Kazav <gabi.ka...@pursway.com> wrote: > Hi, > > > > We are working on hive DB with our Hadoop cluster. > > We now facing an issue about joining a big partition with more than > 2^31 rows. > > When the partition has more than 2147483648 rows (even 2147483649) the > output of the join is a single row. > When the partition has less than 2147483648 rows (event 2147483647) > the output is correct. > > Our test case: > > create a table with 2147483649 rows in a partition with the value : > "1" , join this table to another table with a single row,single column > with the value "1" on the partition_key. > later delete 2 rows and run the same join. > 1st : only a single row is created > 2nd : 2147483647 rows > > the query we run for test the case is: > > > > create table output_rows_over as > > select a.s1 > > from max_sint_rows a join small_table b > > on (a.p1=b.p1); > > > > on more than 2^31 rows we got the following on reducer log: > > 2013-05-27 21:51:14,186 INFO > org.apache.hadoop.hive.ql.exec.FileSinkOperator: TABLE_ID_1_ROWCOUNT:1 > > On less than 2^31 rows we got the following reducer log: > > 2013-05-27 23:43:14,681 INFO > org.apache.hadoop.hive.ql.exec.FileSinkOperator: > TABLE_ID_1_ROWCOUNT:2147483647 > > > > > > Anyone faced this issue? > > Does hive has workaround for that? > > I have huge partitions I need to work on and I cannot use hive for that.. > > > > Thanks, > > > > > > Gabi Kazav > > Infrastructure Team Leader, Pursway.com > > > > ************************************************************************************ This footnote confirms that this email message has been scanned by PineApp Mail-SeCure for the presence of malicious code, vandals & computer viruses. ************************************************************************************