[ https://issues.apache.org/jira/browse/HADOOP-4010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Douglas reopened HADOOP-4010: ----------------------------------- > Chaging LineRecordReader algo so that it does not need to skip backwards in > the stream > -------------------------------------------------------------------------------------- > > Key: HADOOP-4010 > URL: https://issues.apache.org/jira/browse/HADOOP-4010 > Project: Hadoop Common > Issue Type: Improvement > Affects Versions: 0.19.0 > Reporter: Abdul Qadeer > Assignee: Abdul Qadeer > Fix For: 0.21.0 > > Attachments: 4010-mapreduce.patch, Hadoop-4010.patch, > Hadoop-4010_version2.patch, Hadoop-4010_version3.patch > > > The current algorithm of the LineRecordReader needs to move backwards in the > stream (in its constructor) to correctly position itself in the stream. So > it moves back one byte from the start of its split and try to read a record > (i.e. a line) and throws that away. This is so because it is sure that, this > line would be taken care of by some other mapper. This algorithm is > difficult and in-efficient if used for compressed stream where data is coming > to the LineRecordReader via some codecs. (Although in the current > implementation, Hadoop does not split a compressed file and only makes one > split from the start to the end of the file and so only one mapper handles > it. We are currently working on BZip2 codecs where splitting is possible to > work with Hadoop. So this proposed change will make it possible to uniformly > handle plain as well as compressed stream.) > In the new algorithm, each mapper always skips its first line because it is > sure that, that line would have been read by some other mapper. So now each > mapper must finish its reading at a record boundary which is always beyond > its upper split limit. Due to this change, LineRecordReader does not need to > move backwards in the stream. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.