Thanks Antoine. Yes I have newlines_in_values set to false. Other configs
also look ok.
However I do have rows with less number of columns than the specified
numbers in convert options in column types. I have my own
invalid_row_handler where I currently skip these rows.
It looks like the parser is doing a quick pass and splitting the blocks
using the new line separator and parsing them in parallel. This seems
similar to spark csv multiline option. To avoid this case, I did set
newlines_in_values to false and also checked if newlines are not there in
any of the fields.
I do have my own inputstream that feeds data into arrow streaming reader. I
checked the Read bytes in this stream and they are in order and serialized
by mutexes as well.
Thanks
HK

On Mon, Mar 7, 2022 at 11:31 AM Antoine Pitrou <[email protected]> wrote:

>
> Hi HK,
>
> On Mon, 7 Mar 2022 10:16:07 -0800
> HK Verma <[email protected]> wrote:
> > I am integrating Arrow with another C++ library. For this, I wrote an
> input
> > stream which feeds CSV data into the streaming reader. It fails for very
> > large files with the error messages like - "CSV parser got out of sync
> with
> > chunker".
>
> This probably means your CSV data embeds newlines in values, hence the
> naive (but extremely fast) CSV chunking doesn't correspond to the
> actual CSV boundaries as detected by the full blown CSV parser.
>
> Can you set the relevant option to true and try again?
>
> https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow3csv12ParseOptions18newlines_in_valuesE
>
> Regards
>
> Antoine.
>
>
>

Reply via email to