Hi,
I took it one step ahead and dropped the partition:

hive> desc max_int;
OK
s1      int
Time taken: 0.238 seconds

hive> desc small_table;
OK
s1      int
Time taken: 0.187 seconds

Max_int has 2^31+1 records, which them all the number 1:
hive> select * from max_int limit 5;
OK
1
1
1
1
1
Time taken: 0.813 seconds

Small_table has 1 record , the number 1:
hive> select * from small_table;
OK
1
Time taken: 0.122 seconds

I am running:
create table output_rows as select a.s1 from max_int a join small_table b on 
(a.s1=b.s1);

and getting 1 row at the end...:
hive> select count (*) from output_rows;                                        
OK
1
Time taken: 14.892 seconds

if I filling the table max_int with 2^31-1 records of the int 1, I am getting 
the right data at the end.. (output_rows 2 is the table I created with the join 
query, on the 2^31-1 record table):
hive> select count (*) from output_rows2;
OK
2147483647
Time taken: 472.449 seconds




-----Original Message-----
From: John Meagher [mailto:john.meag...@gmail.com] 
Sent: Thursday, May 30, 2013 12:06 AM
To: user@hive.apache.org
Subject: Re: Hive - max rows limit (int limit = 2^31). need Help (looks liek a 
bug)

What is the data type of the p1 column?  I've used hive with partitions 
containing far above 2 billion rows without having any problems like this.

On Wed, May 29, 2013 at 2:41 PM, Gabi Kazav <gabi.ka...@pursway.com> wrote:
> Hi,
>
>
>
> We are working on hive DB with our Hadoop cluster.
>
> We now facing an issue about joining a big partition with more than 
> 2^31 rows.
>
> When the partition has more than 2147483648 rows (even 2147483649) the 
> output of the join is a single row.
> When the partition has less than 2147483648 rows (event 2147483647) 
> the output is correct.
>
> Our test case:
>
> create a table with 2147483649 rows in a partition with the value : 
> "1" , join this table to another table with a single row,single column 
> with the value "1" on the partition_key.
> later delete 2 rows and run the same join.
> 1st : only a single row is created
> 2nd : 2147483647 rows
>
> the query we run for test the case is:
>
>
>
> create table output_rows_over as
>
> select a.s1
>
> from  max_sint_rows a join small_table b
>
> on (a.p1=b.p1);
>
>
>
> on more than 2^31 rows we got the following on reducer log:
>
> 2013-05-27 21:51:14,186 INFO
> org.apache.hadoop.hive.ql.exec.FileSinkOperator: TABLE_ID_1_ROWCOUNT:1
>
> On less than 2^31 rows we got the following reducer log:
>
> 2013-05-27 23:43:14,681 INFO
> org.apache.hadoop.hive.ql.exec.FileSinkOperator:
> TABLE_ID_1_ROWCOUNT:2147483647
>
>
>
>
>
> Anyone faced this issue?
>
> Does hive has workaround for that?
>
> I have huge partitions I need to work on and I cannot use hive for that..
>
>
>
> Thanks,
>
>
>
>
>
> Gabi Kazav
>
> Infrastructure Team Leader, Pursway.com
>
>
>
>

 
 
************************************************************************************
This footnote confirms that this email message has been scanned by PineApp 
Mail-SeCure for the presence of malicious code, vandals & computer viruses.
************************************************************************************



Reply via email to