In addition, put the smaller table on the left-hand side of a JOIN:

SELECT ... FROM small_table JOIN large_table ON ...


From: Bejoy Ks [mailto:bejoy...@yahoo.com]
Sent: Wednesday, March 16, 2011 11:43 AM
To: user@hive.apache.org
Subject: Re: Hadoop error 2 while joining two large tables

Hey hadoop n00b
    I second Mark's thought. But definitely you can try out re framing your 
query to get things rolling. I'm not sure on your hive Query.But still, from my 
experience with joins on huge tables (record counts in the range of hundreds of 
millions) you should give join conditions with JOIN ON clause rather than 
specifying all conditions in WHERE.

Say if you have a query this way
SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE
a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2;

You can definitely re frame this query as
SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b
ON (a.Column4=b.Column1 AND a.Column2=b.Column4)  WHERE a.Column3 > b.Column2;

From my understanding Hive supports equijoins so you can't have the inequality 
conditions there within JOIN ON, inequality should come to WHERE. This approach 
has worked for me when I encountered a similar situation as yours some time 
ago. Try this out,hope it helps.

Regards
Bejoy.K.S


________________________________
From: "Sunderlin, Mark" <mark.sunder...@teamaol.com>
To: "user@hive.apache.org" <user@hive.apache.org>
Sent: Wed, March 16, 2011 11:22:09 PM
Subject: RE: Hadoop error 2 while joining two large tables


hadoop n00b asks, “Is adding more nodes the solution to such problem?”

Whatever else answers you get, you should append “ … and add more nodes.” More 
nodes is never a bad thing ;-)

---
Mark E. Sunderlin
Solutions Architect |AOL Data Warehouse
P: 703-256-6935 | C: 540-327-6222
AIM: MESunderlin
22000 AOL Way | Dulles, VA | 20166


From: hadoop n00b [mailto:new2h...@gmail.com]
Sent: Wednesday, March 16, 2011 3:33 AM
To: user@hive.apache.org
Subject: Fwd: Hadoop error 2 while joining two large tables

Hello,

I am trying to execute a query that joins two large tables (3 million and 20 
million records). I am getting the Hadoop error code 2 during execution. This 
happens mainly while the reducers are running. Sometimes the reducers complete 
100% and then the error comes. The logs talk about running out of Heap space 
and GC overhead limit exceeding.

I am running a 6 node cluster with child JVM memory of 1GB.

Are there any parameters I could tweak to make them run? Is adding more nodes 
the solution to such problem?

Thanks!

Reply via email to