Hi all,

I have two tables:

tbl1: 81m rows
tbl2: 4m rows

tbl1 is partitioned on one column and tbl2 has none.

I'm attempting the following query:

SELECT
tbl1.col_pk,
tbl2.col1,
tbl2.col2,
SUM(tbl1.col4),
SUM(tbl1.col5),
SUM(tbl1.col4+col5)
FROM tbl2
JOIN tbl1 ON (tbl1.col_pk=tbl2.col_pk)
WHERE tbl1.partitioned_col IN ('2011','2012','2013')
GROUP BY
tbl1.col_pk,
tbl2.col1,
tbl2.col2;

I get this error:

OutOfMemoryError: GC overhead limit exceeded

So, I followed the suggestion at the end of the error output (Currently 
hive.map.aggr.hash.percentmemory is set to 0.5. Try setting it to a lower 
value. i.e 'set hive.map.aggr.hash.percentmemory = 0.25;') through several 
iterations, eventually getting my hive.map.aggr.hash.percentmemory setting down 
to something like .0165 and it still failed.

I did some searching and found some convoluted recommendations of what to try 
next. Some mentioned upping my heap size, some mentioned re-writing my query, 
etc. I upped my Hadoop maximum Java heap size to 4096mb ,re-ran, and got the 
same results.

Currently, some relevant settings are:

NameNode Heap Size: 4096mb
DataNode maximum Java heap size: 4096mb
Hadoop maximum Java heap size: 4096mb
Java Options for MapReduce tasks: 768mb

I have 16 map slots and 8 reduce slots available (5 node cluster, 4 data and 
one name)

Thanks in advance for the help,
Nick

Reply via email to