Re: Large size Text file split

jason hadoop Wed, 10 Jun 2009 05:39:45 -0700

There is always NLineInputFormat. You specify the number of lines per split.
The key is the position of the line start in the file, value is the line
itself.
The parameter mapred.line.input.format.linespermap controls the number of
lines per split


On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi <
[email protected]> wrote:

> On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo <[email protected]>
> wrote:
>
> > Hi, all
> >
> > I have a large csv file ( larger than 10 GB ), I'd like to use a certain
> > InputFormat to split it into smaller part thus each Mapper can deal with
> > piece of the csv file. However, as far as I know, FileInputFormat only
> > cares about byte size of file, that is, the class can divide the csv
> > file as many part, and maybe some part is not a well-format CVS file.
> > For example, one line of the CSV file is not terminated with CRLF, or
> > maybe some text is trimed.
> >
> > How to ensure each FileSplit is a smaller valid CSV file using a proper
> > InputFormat?
> >
> > BR/anderson
> >
>
> If all you care about is the splits occurring at line boundaries, then
> TextInputFormat will work.
>
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html
>
> If not I guess you can write your own InputFormat class.
>
> --
> Harish Mallipeddi
> http://blog.poundbang.in
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Re: Large size Text file split

Reply via email to