Thanks Antoine. Yes I have newlines_in_values set to false. Other configs also look ok. However I do have rows with less number of columns than the specified numbers in convert options in column types. I have my own invalid_row_handler where I currently skip these rows. It looks like the parser is doing a quick pass and splitting the blocks using the new line separator and parsing them in parallel. This seems similar to spark csv multiline option. To avoid this case, I did set newlines_in_values to false and also checked if newlines are not there in any of the fields. I do have my own inputstream that feeds data into arrow streaming reader. I checked the Read bytes in this stream and they are in order and serialized by mutexes as well. Thanks HK
On Mon, Mar 7, 2022 at 11:31 AM Antoine Pitrou <[email protected]> wrote: > > Hi HK, > > On Mon, 7 Mar 2022 10:16:07 -0800 > HK Verma <[email protected]> wrote: > > I am integrating Arrow with another C++ library. For this, I wrote an > input > > stream which feeds CSV data into the streaming reader. It fails for very > > large files with the error messages like - "CSV parser got out of sync > with > > chunker". > > This probably means your CSV data embeds newlines in values, hence the > naive (but extremely fast) CSV chunking doesn't correspond to the > actual CSV boundaries as detected by the full blown CSV parser. > > Can you set the relevant option to true and try again? > > https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow3csv12ParseOptions18newlines_in_valuesE > > Regards > > Antoine. > > >
