[ https://issues.apache.org/jira/browse/HADOOP-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Allen Wittenauer resolved HADOOP-2560. -------------------------------------- Resolution: Duplicate This appears to predate MFIF/CFIF, as introduced by HADOOP-4565 which appears to fix the issue. I'm going to close this out as resolved as a result. > Processing multiple input splits per mapper task > ------------------------------------------------ > > Key: HADOOP-2560 > URL: https://issues.apache.org/jira/browse/HADOOP-2560 > Project: Hadoop Common > Issue Type: Bug > Reporter: Runping Qi > Assignee: dhruba borthakur > Attachments: multipleSplitsPerMapper.patch > > > Currently, an input split contains a consecutive chunk of input file, which > by default, corresponding to a DFS block. > This may lead to a large number of mapper tasks if the input data is large. > This leads to the following problems: > 1. Shuffling cost: since the framework has to move M * R map output segments > to the nodes running reducers, > larger M means larger shuffling cost. > 2. High JVM initialization overhead > 3. Disk fragmentation: larger number of map output files means lower read > throughput for accessing them. > Ideally, you want to keep the number of mappers to no more than 16 times the > number of nodes in the cluster. > To achive that, we can increase the input split size. However, if a split > span over more than one dfs block, > you lose the data locality scheduling benefits. > One way to address this problem is to combine multiple input blocks with the > same rack into one split. > If in average we combine B blocks into one split, then we will reduce the > number of mappers by a factor of B. > Since all the blocks for one mapper share a rack, thus we can benefit from > rack-aware scheduling. > Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)