kolulu23 opened a new issue, #13087:
URL: https://github.com/apache/datafusion/issues/13087
### Describe the bug
CsvFormat `infer_schema` reports `UnequalLengths` error despite having
quotes and escape in its options.
This would suprise user because `SessionContext::register_csv` accepts
`CsvReadOptions` but `infer_schema` somehow does not fully use it.
### To Reproduce
For this csv file `test.csv`:
```csv
c1,c2,c3,c4
2166.105475712115,")8P~f(Je/+\",@pV<",g$vGzWhTxeZzXc!{,0
```
Note that some columns are quoted with `"` and have escape character `\`
inside.
This test would fail:
```rust
#[cfg(test)]
mod test {
use datafusion::error::DataFusionError;
use datafusion::prelude::{CsvReadOptions, SessionContext};
#[tokio::test]
async fn infer_schema_failure() {
let ctx = SessionContext::new();
let r = ctx
.register_csv(
"test",
"test.csv",
CsvReadOptions::new()
.has_header(true)
.quote(b'"')
.escape(b'\\'),
)
.await;
assert!(r.is_ok());
}
}
```
The error is `Encountered unequal lengths between records on CSV file whilst
inferring schema. Expected 4 records, found 5 records`.
### Expected behavior
`register_csv` should not return `Err` because `CsvReadOptions` has
specified header, quotes and escape character.
Underlying csv reader should use this option to infer schema.
### Additional context
If a schema is provided to `CsvReadOptions` and is correct to `test.csv`,
then the test is passed and the csv table can be used.
After some debugging, I found that the creation of
`arrow::csv::reader::Format` in `CsvFormat::infer_schema_from_stream` does not
use the quotes and escape settings in `CsvFormat` which is odd to me.
https://github.com/apache/datafusion/blob/f2da32b3bde851c34e9df0a2f4c174a5392f8897/datafusion/core/src/datasource/file_format/csv.rs#L440-L456
I did dig further into the `arrow-csv` and `csv` crate, and the quotation
and escaping options are all there, I think if the right option is passed to
it, `infer_schema` would be more easy to use.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]