Re: Large size Text file split

Aaron Kimball Thu, 11 Jun 2009 00:13:25 -0700

A FileSplit is merely a description of the boundaries. e.g., "bytes 0 to
9999" and "bytes 10000 to 19999". The Mapper then interprets the boundaries
described by a FileSplit in a way that makes sense at the data level.  The
FileSplit does not actually physically contain the data to be mapped over.


So mapper 1 will open a file via the InputFormat and start reading at byte
0, and stop reading when it gets to its "final record," which is defined as
the first record which stops after byte 9999. If it has to read through
bytes 10020, that's ok. The stream used to read the bytes from the file will
not "cut off" at 9999.

Mapper 2 starts reading at byte 10000. It finds the first newline at byte
10020, so the first "real" record it processes starts at byte 10021.

- Aaron


On Wed, Jun 10, 2009 at 9:02 PM, Wenrui Guo <[email protected]> wrote:

> I don't understand the internal logic of the FileSplit and Mapper.
>
> By my understanding, I think FileInputFormat is the actual class that
> takes care of the file spliting. So it's reasonable if one large file is
> splited into 5 smaller parts, each parts is less than 2GB (since we
> specify the numberOfSplit is 5).
>
> However, the FileSplit is rough edges, so mapper 1 which takes the split
> 1 as input omit the incomplete parts at the end of split 1, then mapper
> 2 will continue to read that incomplete part then add the remaining part
> of split 2?
>
> Take this as example:
>
> The original file is:
>
> 1::122::5::838985046 (CRLF)
> 1::185::5::838983525 (CRLF)
> 1::231::5::838983392 (CRLF)
>
> Assume number of split is 2, then the above content is divied into two
> part:
>
> Split 1:
> 1::122::5::838985046 (CRLF)
> 1::185::5::8
>
>
> Split 2:
> 38983525 (CRLF)
> 1::231::5::838983392 (CRLF)
>
> Afterwards, Mapper 1 takes split 1 as input, but after eat the line
> 1::122::5::838985046, it found the remaining part is not a complete
> record, then Mapper 1 bypass it, but Mapper 2 will read this and add it
> ahead of first line of Split 2 to compose a valid record.
>
> Is it correct ? If it is, which class implements the above logic?
>
> BR/anderson
>
> -----Original Message-----
> From: Aaron Kimball [mailto:[email protected]]
> Sent: Thursday, June 11, 2009 11:49 AM
> To: [email protected]
> Subject: Re: Large size Text file split
>
> The FileSplit boundaries are "rough" edges -- the mapper responsible for
> the previous split will continue until it finds a full record, and the
> next mapper will read ahead and only start on the first record boundary
> after the byte offset.
> - Aaron
>
> On Wed, Jun 10, 2009 at 7:53 PM, Wenrui Guo <[email protected]>
> wrote:
>
> > I think the default TextInputFormat can meet my requirement. However,
> > even if the JavaDoc of TextInputFormat says the TextInputFormat could
> > divide input file as text lines which ends with CRLF. But I'd like to
> > know if the FileSplit size is not N times of line length, what will be
>
> > happen eventually?
> >
> > BR/anderson
> >
> > -----Original Message-----
> > From: jason hadoop [mailto:[email protected]]
> > Sent: Wednesday, June 10, 2009 8:39 PM
> > To: [email protected]
> > Subject: Re: Large size Text file split
> >
> > There is always NLineInputFormat. You specify the number of lines per
> > split.
> > The key is the position of the line start in the file, value is the
> > line itself.
> > The parameter mapred.line.input.format.linespermap controls the number
>
> > of lines per split
> >
> > On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi <
> > [email protected]> wrote:
> >
> > > On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo
> > > <[email protected]>
> > > wrote:
> > >
> > > > Hi, all
> > > >
> > > > I have a large csv file ( larger than 10 GB ), I'd like to use a
> > > > certain InputFormat to split it into smaller part thus each Mapper
>
> > > > can deal with piece of the csv file. However, as far as I know,
> > > > FileInputFormat only cares about byte size of file, that is, the
> > > > class can divide the csv file as many part, and maybe some part is
> > not a well-format CVS file.
> > > > For example, one line of the CSV file is not terminated with CRLF,
>
> > > > or maybe some text is trimed.
> > > >
> > > > How to ensure each FileSplit is a smaller valid CSV file using a
> > > > proper InputFormat?
> > > >
> > > > BR/anderson
> > > >
> > >
> > > If all you care about is the splits occurring at line boundaries,
> > > then
> >
> > > TextInputFormat will work.
> > >
> > > http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/map
> > > re
> > > d/TextInputFormat.html
> > >
> > > If not I guess you can write your own InputFormat class.
> > >
> > > --
> > > Harish Mallipeddi
> > > http://blog.poundbang.in
> > >
> >
> >
> >
> > --
> > Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> > http://www.apress.com/book/view/9781430219422
> > www.prohadoopbook.com a community for Hadoop Professionals
> >
>

Re: Large size Text file split

Reply via email to