Re: sc.textFile problem due to newlines within a CSV record

Mohit Jaggi Sat, 13 Sep 2014 00:44:31 -0700

Thanks Xiangrui. This file already exists w/o escapes. I could probably try
to preprocess it and add the escaping.


On Fri, Sep 12, 2014 at 9:38 PM, Xiangrui Meng <men...@gmail.com> wrote:

> I wrote an input format for Redshift's tables unloaded UNLOAD the
> ESCAPE option: https://github.com/mengxr/redshift-input-format , which
> can recognize multi-line records.
>
> Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and
> the delimiter character. You can apply the same escaping before
> calling saveAsTextFIle, then use the input format to load them back.
>
> Xiangrui
>
> On Fri, Sep 12, 2014 at 7:43 PM, Mohit Jaggi <mohitja...@gmail.com> wrote:
> > Folks,
> > I think this might be due to the default TextInputFormat in Hadoop. Any
> > pointers to solutions much appreciated.
> >>>
> > More powerfully, you can define your own InputFormat implementations to
> > format the input to your programs however you want. For example, the
> default
> > TextInputFormat reads lines of text files. The key it emits for each
> record
> > is the byte offset of the line read (as a LongWritable), and the value is
> > the contents of the line up to the terminating '\n' character (as a Text
> > object). If you have multi-line records each separated by a $character,
> you
> > could write your own InputFormat that parses files into records split on
> > this character instead.
> >>>
> >
> > Thanks,
> > Mohit
>

Re: sc.textFile problem due to newlines within a CSV record

Reply via email to