Hi Carp,
 Your assumption is right that this is a per-map-task setting.
However, this buffer stores map output KVPs, not input. Therefore the optimal 
value depends on how much data your map task is generating.

If your output per map is greater than io.sort.mb, these rules of thumb that 
could work for you:

1) Increase max heap of map tasks to use RAM better, but not hit swap.
2) Set io.sort.mb to ~70% of heap.

Overall, causing extra "spills" (because of insufficient io.sort.mb) is much 
better than risking swapping (by setting io.sort.mb and heap too large), in 
terms of relative performance penalty you will pay.

Cheers,
Sriguru

>-----Original Message-----
>From: 李钰 [mailto:car...@gmail.com]
>Sent: Wednesday, June 23, 2010 12:27 PM
>To: common-dev@hadoop.apache.org
>Subject: Questions about recommendation value of the "io.sort.mb"
>parameter
>
>Dear all,
>
>Here I've got a question about the "io.sort.mb" parameter. We can find
>material from Yahoo! or Cloudera which recommend setting this value to
>200
>if the job scale is large, but I'm confused about this. As I know,
>the tasktracker will launch a child-JVM for each task, and
>“*io.sort.mb*”
>presents the buffer size in memory inside *one map task child-JVM*, the
>default value 100MB should be large enough because the input split of
>one
>map task is usually 64MB, as large as the block size we usually set.
>Then
>why the recommendation of “*io.sort.mb*” is 200MB for large jobs (and
>it
>really works)? How could the job size affect the procedure?
>Is there any fault here of my understanding? Any comment/suggestion
>will be
>highly valued, thanks in advance.
>
>Best Regards,
>Carp

Reply via email to