I don't understand the internal logic of the FileSplit and Mapper. By my understanding, I think FileInputFormat is the actual class that takes care of the file spliting. So it's reasonable if one large file is splited into 5 smaller parts, each parts is less than 2GB (since we specify the numberOfSplit is 5).
However, the FileSplit is rough edges, so mapper 1 which takes the split 1 as input omit the incomplete parts at the end of split 1, then mapper 2 will continue to read that incomplete part then add the remaining part of split 2? Take this as example: The original file is: 1::122::5::838985046 (CRLF) 1::185::5::838983525 (CRLF) 1::231::5::838983392 (CRLF) Assume number of split is 2, then the above content is divied into two part: Split 1: 1::122::5::838985046 (CRLF) 1::185::5::8 Split 2: 38983525 (CRLF) 1::231::5::838983392 (CRLF) Afterwards, Mapper 1 takes split 1 as input, but after eat the line 1::122::5::838985046, it found the remaining part is not a complete record, then Mapper 1 bypass it, but Mapper 2 will read this and add it ahead of first line of Split 2 to compose a valid record. Is it correct ? If it is, which class implements the above logic? BR/anderson -----Original Message----- From: Aaron Kimball [mailto:[email protected]] Sent: Thursday, June 11, 2009 11:49 AM To: [email protected] Subject: Re: Large size Text file split The FileSplit boundaries are "rough" edges -- the mapper responsible for the previous split will continue until it finds a full record, and the next mapper will read ahead and only start on the first record boundary after the byte offset. - Aaron On Wed, Jun 10, 2009 at 7:53 PM, Wenrui Guo <[email protected]> wrote: > I think the default TextInputFormat can meet my requirement. However, > even if the JavaDoc of TextInputFormat says the TextInputFormat could > divide input file as text lines which ends with CRLF. But I'd like to > know if the FileSplit size is not N times of line length, what will be > happen eventually? > > BR/anderson > > -----Original Message----- > From: jason hadoop [mailto:[email protected]] > Sent: Wednesday, June 10, 2009 8:39 PM > To: [email protected] > Subject: Re: Large size Text file split > > There is always NLineInputFormat. You specify the number of lines per > split. > The key is the position of the line start in the file, value is the > line itself. > The parameter mapred.line.input.format.linespermap controls the number > of lines per split > > On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi < > [email protected]> wrote: > > > On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo > > <[email protected]> > > wrote: > > > > > Hi, all > > > > > > I have a large csv file ( larger than 10 GB ), I'd like to use a > > > certain InputFormat to split it into smaller part thus each Mapper > > > can deal with piece of the csv file. However, as far as I know, > > > FileInputFormat only cares about byte size of file, that is, the > > > class can divide the csv file as many part, and maybe some part is > not a well-format CVS file. > > > For example, one line of the CSV file is not terminated with CRLF, > > > or maybe some text is trimed. > > > > > > How to ensure each FileSplit is a smaller valid CSV file using a > > > proper InputFormat? > > > > > > BR/anderson > > > > > > > If all you care about is the splits occurring at line boundaries, > > then > > > TextInputFormat will work. > > > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/map > > re > > d/TextInputFormat.html > > > > If not I guess you can write your own InputFormat class. > > > > -- > > Harish Mallipeddi > > http://blog.poundbang.in > > > > > > -- > Pro Hadoop, a book to guide you from beginner to hadoop mastery, > http://www.apress.com/book/view/9781430219422 > www.prohadoopbook.com a community for Hadoop Professionals >
