Hi all, I have two tables:
tbl1: 81m rows tbl2: 4m rows tbl1 is partitioned on one column and tbl2 has none. I'm attempting the following query: SELECT tbl1.col_pk, tbl2.col1, tbl2.col2, SUM(tbl1.col4), SUM(tbl1.col5), SUM(tbl1.col4+col5) FROM tbl2 JOIN tbl1 ON (tbl1.col_pk=tbl2.col_pk) WHERE tbl1.partitioned_col IN ('2011','2012','2013') GROUP BY tbl1.col_pk, tbl2.col1, tbl2.col2; I get this error: OutOfMemoryError: GC overhead limit exceeded So, I followed the suggestion at the end of the error output (Currently hive.map.aggr.hash.percentmemory is set to 0.5. Try setting it to a lower value. i.e 'set hive.map.aggr.hash.percentmemory = 0.25;') through several iterations, eventually getting my hive.map.aggr.hash.percentmemory setting down to something like .0165 and it still failed. I did some searching and found some convoluted recommendations of what to try next. Some mentioned upping my heap size, some mentioned re-writing my query, etc. I upped my Hadoop maximum Java heap size to 4096mb ,re-ran, and got the same results. Currently, some relevant settings are: NameNode Heap Size: 4096mb DataNode maximum Java heap size: 4096mb Hadoop maximum Java heap size: 4096mb Java Options for MapReduce tasks: 768mb I have 16 map slots and 8 reduce slots available (5 node cluster, 4 data and one name) Thanks in advance for the help, Nick