RE: Hadoop - non disk based sorting?

Ravi teja ch n v Thu, 01 Dec 2011 02:16:23 -0800

Hi Mingxi ,

>So, why when map outputs are huge, reducer will not able to copy them?


The Reducer  will copy the Map output into its inmemory buffer. When the 
Reducer JVM doesnt have enough memory to accomodate the
Map output, then it leads to OutOfMemoryException.

>Can you please kindly explain what's the function of mapred.child.java.opts? 
>how does it relate to copy?

The Maps and Reducers will be launched in separate child JVMs launched at the 
Tasktrackers.
When the Tasktracker launches the Map or Reduce JVMs, it uses the 
mapred.child.java.opts as JVM arguments for the new child JVMs.

Regards,
Ravi Teja
________________________________________
From: Mingxi Wu [mingxi...@turn.com]
Sent: 01 December 2011 12:37:54
To: common-dev@hadoop.apache.org
Subject: RE: Hadoop - non disk based sorting?

Thanks Ravi.

So, why when map outputs are huge, reducer will not able to copy them?

Can you please kindly explain what's the function of mapred.child.java.opts? 
how does it relate to copy?

Thank you,

Mingxi

-----Original Message-----
From: Ravi teja ch n v [mailto:raviteja.c...@huawei.com]
Sent: Tuesday, November 29, 2011 9:46 PM
To: common-dev@hadoop.apache.org
Subject: RE: Hadoop - non disk based sorting?

Hi Mingxi,

>From your stacktrace, I understand that the OutOfMemoryError has actually 
>occured while copying the MapOutputs, not while sorting them.

Since your Mapoutputs are huge and your reducer does have enough heap memory, 
you got the problem.
When you have made the reducers to 200, your Map outputs have got partitioned 
amoung 200 reducers, so you didnt get this problem.

By setting the max memory of your reducer with mapred.child.java.opts, you can 
get over this problem.

Regards,
Ravi teja


________________________________________
From: Mingxi Wu [mingxi...@turn.com]
Sent: 30 November 2011 05:14:49
To: common-dev@hadoop.apache.org
Subject: Hadoop - non disk based sorting?

Hi,

I have a question regarding the shuffle phase of reducer.

It appears when there are large map output (in my case, 5 billion records), I 
will have out of memory Error like below.

Error: java.lang.OutOfMemoryError: Java heap space at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1592)
 at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1452)
 at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301)
 at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233)

However, I thought the shuffling phase is using disk-based sort, which is not 
constraint by memory.
So, why will user run into this outofmemory error? After I increased my number 
of reducers from 100 to 200, the problem went away.

Any input regarding this memory issue would be appreciated!

Thanks,

Mingxi

RE: Hadoop - non disk based sorting?

Reply via email to