In addition, put the smaller table on the left-hand side of a JOIN: SELECT ... FROM small_table JOIN large_table ON ...
From: Bejoy Ks [mailto:bejoy...@yahoo.com] Sent: Wednesday, March 16, 2011 11:43 AM To: user@hive.apache.org Subject: Re: Hadoop error 2 while joining two large tables Hey hadoop n00b I second Mark's thought. But definitely you can try out re framing your query to get things rolling. I'm not sure on your hive Query.But still, from my experience with joins on huge tables (record counts in the range of hundreds of millions) you should give join conditions with JOIN ON clause rather than specifying all conditions in WHERE. Say if you have a query this way SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b WHERE a.Column4=b.Column1 AND a.Column2=b.Column4 AND a.Column3 > b.Column2; You can definitely re frame this query as SELECT a.Column1,a.Column2,b.Column1 FROM Table1 a JOIN Table2 b ON (a.Column4=b.Column1 AND a.Column2=b.Column4) WHERE a.Column3 > b.Column2; From my understanding Hive supports equijoins so you can't have the inequality conditions there within JOIN ON, inequality should come to WHERE. This approach has worked for me when I encountered a similar situation as yours some time ago. Try this out,hope it helps. Regards Bejoy.K.S ________________________________ From: "Sunderlin, Mark" <mark.sunder...@teamaol.com> To: "user@hive.apache.org" <user@hive.apache.org> Sent: Wed, March 16, 2011 11:22:09 PM Subject: RE: Hadoop error 2 while joining two large tables hadoop n00b asks, “Is adding more nodes the solution to such problem?” Whatever else answers you get, you should append “ … and add more nodes.” More nodes is never a bad thing ;-) --- Mark E. Sunderlin Solutions Architect |AOL Data Warehouse P: 703-256-6935 | C: 540-327-6222 AIM: MESunderlin 22000 AOL Way | Dulles, VA | 20166 From: hadoop n00b [mailto:new2h...@gmail.com] Sent: Wednesday, March 16, 2011 3:33 AM To: user@hive.apache.org Subject: Fwd: Hadoop error 2 while joining two large tables Hello, I am trying to execute a query that joins two large tables (3 million and 20 million records). I am getting the Hadoop error code 2 during execution. This happens mainly while the reducers are running. Sometimes the reducers complete 100% and then the error comes. The logs talk about running out of Heap space and GC overhead limit exceeding. I am running a 6 node cluster with child JVM memory of 1GB. Are there any parameters I could tweak to make them run? Is adding more nodes the solution to such problem? Thanks!