connec opened a new issue, #11472:
URL: https://github.com/apache/datafusion/issues/11472

   ### Is your feature request related to a problem or challenge?
   
   I'm trying to read CSVs that include newlines in (quoted) values.
   
   ### Describe the solution you'd like
   
   Some googling revealed that this isn't supported currently by the 
`arrow-csv` crate, whereas that functionality does exist in the C++ 
([`ParseOptions::newlines_in_values`](https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow3csv12ParseOptions18newlines_in_valuesE))
 and Python 
([`ParseOptions.newlines_in_values`](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html#pyarrow.csv.ParseOptions.newlines_in_values))
 implementations.
   
   Ideally, a `newlines_in_values` field could be added to 
[`datafusion::common::config::CsvOptions`](https://docs.rs/datafusion/latest/datafusion/common/config/struct.CsvOptions.html)
 to support this functionality.
   
   Note that the Python docs call out the performance implications of this:
   
   > Setting this to True reduces the performance of multi-threaded CSV reading.
   
   I haven't dug into the implementation, but I imagine it becomes harder to 
find the right split point for multi-threaded reading (though, it seems not 
dissimilar to finding the prev/next linebreak, so perhaps not 
insurmountable...).
   
   ### Describe alternatives you've considered
   
   The only alternative I can see would be to preprocess the CSV before feeding 
it into DF. I haven't explored this option as I imagine it would take a lot of 
DF plumbing, and it seems valuable to have parity with other arrow CSV packages 
(C++ and Python, at least).
   
   ### Additional context
   
   I was originally planning to report this against the `arrow-rs` repository, 
but since my use-case is with `datafusion` I decided to report it here. Let me 
know if this issue would be more appropriate there and I can move/copy it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to