Re: sc.textFile problem due to newlines within a CSV record

2014-09-13 Thread Mohit Jaggi
Thanks Xiangrui. This file already exists w/o escapes. I could probably try to preprocess it and add the escaping. On Fri, Sep 12, 2014 at 9:38 PM, Xiangrui Meng wrote: > I wrote an input format for Redshift's tables unloaded UNLOAD the > ESCAPE option: https://github.com/mengxr/redshift-input-f

Re: sc.textFile problem due to newlines within a CSV record

2014-09-12 Thread Xiangrui Meng
I wrote an input format for Redshift's tables unloaded UNLOAD the ESCAPE option: https://github.com/mengxr/redshift-input-format , which can recognize multi-line records. Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and the delimiter character. You can apply the same escaping b

sc.textFile problem due to newlines within a CSV record

2014-09-12 Thread Mohit Jaggi
Folks, I think this might be due to the default TextInputFormat in Hadoop. Any pointers to solutions much appreciated. >> More powerfully, you can define your own *InputFormat* implementations to format the input to your programs however you want. For example, the default TextInputFormat reads line