Thanks Jothi... -Tarandeep
On Fri, Jun 12, 2009 at 4:35 AM, Jothi Padmanabhan <[email protected]>wrote: > If the number of maps is reduced, it is possible that the size of > individual map outputs might increase. A couple of possible issues come to > mind immediately: > 1. Number of spills in the map might be more. This might incur extra cost > during merging. > 2. Also, while the reduces might pull in more data per fetch (which is > good), it might also result in a state where the reducer is not able to > store the map output in memory but needs to shuffle it to disk. > > JVM reuse should help, but if the individual task completion time is very > high, there might not be any discernible performance gain. > > Jothi > > > On 6/11/09 11:36 PM, "Tarandeep Singh" <[email protected]> wrote: > > > Hi, > > > > I am trying to understand the effects of increasing block size or minimum > > split size. If I increase them, then a mapper will process more data, > > effectively reducing the number of mappers that will be spawned. As there > is > > an overhead in starting mappers, so this seems good. > > > > However, If I increase their values too much, what negative effects will > > come up? Put in other words, how to compute what is the best number of > > mappers to start for processing a given size data on a cluster. > > > > For calculations, let us assume- 100G of data, 4 machines (dual core). > > > > Also if I set the reuse jvm flag to -1, will it make a difference? > > > > Thanks, > > Tarandeep > >
