If the number of maps is reduced, it is possible that the size of individual map outputs might increase. A couple of possible issues come to mind immediately: 1. Number of spills in the map might be more. This might incur extra cost during merging. 2. Also, while the reduces might pull in more data per fetch (which is good), it might also result in a state where the reducer is not able to store the map output in memory but needs to shuffle it to disk. JVM reuse should help, but if the individual task completion time is very high, there might not be any discernible performance gain.
Jothi On 6/11/09 11:36 PM, "Tarandeep Singh" <[email protected]> wrote: > Hi, > > I am trying to understand the effects of increasing block size or minimum > split size. If I increase them, then a mapper will process more data, > effectively reducing the number of mappers that will be spawned. As there is > an overhead in starting mappers, so this seems good. > > However, If I increase their values too much, what negative effects will > come up? Put in other words, how to compute what is the best number of > mappers to start for processing a given size data on a cluster. > > For calculations, let us assume- 100G of data, 4 machines (dual core). > > Also if I set the reuse jvm flag to -1, will it make a difference? > > Thanks, > Tarandeep
