Dear all, Here I've got a question about the "io.sort.mb" parameter. We can find material from Yahoo! or Cloudera which recommend setting this value to 200 if the job scale is large, but I'm confused about this. As I know, the tasktracker will launch a child-JVM for each task, and “*io.sort.mb*” presents the buffer size in memory inside *one map task child-JVM*, the default value 100MB should be large enough because the input split of one map task is usually 64MB, as large as the block size we usually set. Then why the recommendation of “*io.sort.mb*” is 200MB for large jobs (and it really works)? How could the job size affect the procedure? Is there any fault here of my understanding? Any comment/suggestion will be highly valued, thanks in advance.
Best Regards, Carp