timvw opened a new issue, #14016:
URL: https://github.com/apache/datafusion/issues/14016
### Describe the bug
Inference of ListingTableConfig does not work (anymore) for compressed json
file
With datafusion 35 and 36 the expected schema is inferred.
With datafusion 37, 38 and 39 we see an error: ArrowError(JsonError("Failed
to read JSON record: stream did not contain valid UTF-8"), None)
With datafusion 40+ we error goes away, but no schema is inferred
### To Reproduce
```rust
let ctx = SessionContext::new();
// the file can be found here:
https://github.com/timvw/arrow-testing/blob/master/data/json/ndjson-sample.json.gz
let data_path = "/somewhere/testing/data/json/ndjson-sample.json.gz";
let table_path = ListingTableUrl::parse(&data_path)?;
let config = ListingTableConfig::new(table_path);
let mut config_with_opts = config.infer_options(&ctx.state()).await?;
let config_with_schema =
config_with_opts.infer_schema(&ctx.state()).await?;
```
### Expected behavior
The schema is inferred as in earlier versions
### Additional context
Initial investigation shows that in ListingTableConfig infer_options method
there is information loss:
- file_extension is inferred to be "json" (instead of json.gz in the past)
-> no files will be found in infer_schema
- file_format is created without capturing the (potential) compression type
-> trying to read the file (without codec) results in the error mentionned
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]