If the number of maps is reduced,  it is possible that the size of
individual map outputs might increase. A couple of possible issues come to
mind immediately:
1.  Number of spills in the map might be more. This might incur extra cost
during merging.
2. Also, while the reduces might pull in more data per fetch (which is
good), it might also result in a state where the reducer is not able to
store the map output in memory but needs to shuffle it to disk.
 
JVM reuse should help, but if the individual task completion time is very
high, there might not be any discernible performance gain.

Jothi


On 6/11/09 11:36 PM, "Tarandeep Singh" <[email protected]> wrote:

> Hi,
> 
> I am trying to understand the effects of increasing block size or minimum
> split size. If I increase them, then a mapper will process more data,
> effectively reducing the number of mappers that will be spawned. As there is
> an overhead in starting mappers, so this seems good.
> 
> However, If I increase their values too much, what negative effects will
> come up? Put in other words, how to compute what is the best number of
> mappers to start for processing a given size data on a cluster.
> 
> For calculations, let us assume- 100G of data, 4 machines (dual core).
> 
> Also if I set the reuse jvm flag to -1, will it make a difference?
> 
> Thanks,
> Tarandeep

Reply via email to