We filter nulls already before the tables are filled but then this will probably cause a skew in the keys like Paul was saying. I'm running some queries on the keys to see if that's the case. I do expect there will be large differences in distribution of some of the keys. I'm looking at "set hive.optimize.skewjoin=true" before the query to see if that helps. Will try that later.

On 02/23/2011 05:25 AM, Mapred Learn wrote:
Oops I meant nulls.

Sent from my iPhone

On Feb 22, 2011, at 8:22 PM, Mapred Learn<mapred.le...@gmail.com>  wrote:

Check if you can filter non-nulls. That might help.

Sent from my iPhone

On Feb 22, 2011, at 12:46 AM, Bennie Schut<bsc...@ebuddy.com>  wrote:

I've just set the "hive.exec.reducers.bytes.per.reducer" to as low as 100k 
which caused this job to run with 999 reducers. I still have 5 tasks failing with an 
outofmemory.

We have jvm reuse set to 8 but dropping it to 1 seems to greatly reduce this 
problem:
set mapred.job.reuse.jvm.num.tasks = 1;

It's still puzzling me how it can run out of memory. It seems like some of the 
reducers get an unequally large share of the work.


On 02/18/2011 10:53 AM, Bennie Schut wrote:
When we try to join two large tables some of the reducers stop with an
OutOfMemory exception.

Error: java.lang.OutOfMemoryError: Java heap space
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)

at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)

at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)

at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)



When looking at garbage collection for these reduce tasks it's
continually doing garbage collections.
Like this:
2011-02-17T14:36:08.295+0100: 1250.547: [Full GC [PSYoungGen:
111055K->53659K(233024K)] [ParOldGen: 698410K->698410K(699072K)]
809466K->752070K(932096K) [PSPermGen: 14450K->14450K(21248K)], 0.1496600
secs] [Times: user=1.08 sys=0.00, real=0.15 secs]
2011-02-17T14:36:08.600+0100: 1250.851: [Full GC [PSYoungGen:
111057K->53660K(233024K)] [ParOldGen: 698410K->698410K(699072K)]
809468K->752070K(932096K) [PSPermGen: 14450K->14450K(21248K)], 0.1360010
secs] [Times: user=1.00 sys=0.01, real=0.13 secs]
2011-02-17T14:36:08.915+0100: 1251.167: [Full GC [PSYoungGen:
111058K->53659K(233024K)] [ParOldGen: 698410K->698410K(699072K)]
809468K->752070K(932096K) [PSPermGen: 14450K->14450K(21248K)], 0.1325960
secs] [Times: user=0.94 sys=0.00, real=0.14 secs]
2011-02-17T14:36:09.205+0100: 1251.457: [Full GC [PSYoungGen:
111055K->53659K(233024K)] [ParOldGen: 698410K->698410K(699072K)]
809466K->752070K(932096K) [PSPermGen: 14450K->14450K(21248K)], 0.1301610
secs] [Times: user=0.99 sys=0.00, real=0.13 secs]


“mapred.child.java.opts” set to “-Xmx1024M -XX:+UseCompressedOops
-XX:+UseParallelOldGC -XX:+UseNUMA -Djava.net.preferIPv4Stack=true
-verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCDetails
-Xloggc:/opt/hadoop/logs/task_@tas...@.gc.log”

I've been reducing this parameter “hive.exec.reducers.bytes.per.reducer”
to as low as 200M but I still get the OutOfMemory errors. I would have
expected this would drop the amount of data send to the reducers and
thus not have the OutOfMemory errors to happen.

Any idea's on why this happens?

I'm using a trunk build from around 2011-02-03

Reply via email to