Blizzara opened a new pull request, #13750: URL: https://github.com/apache/datafusion/pull/13750
## Which issue does this PR close? Ignore empty files in ListingTable. Sometimes input datasets can contain empty files (as in 0 bytes), and trying to treat them like normal files fails when e.g. reading parquet metadata. Closes https://github.com/apache/datafusion/issues/13737. ## Rationale for this change Empty files cannot contribute anything to the table, other than to break things, so ignoring them is pretty much strictly better. Also aligns with e.g. [Spark](https://github.com/apache/spark/blob/b2c8b3069ef4f5288a5964af0da6f6b23a769e6b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L82C9-L82C23). ## What changes are included in this PR? Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema One thing I'm not sure about is that it seems `pruned_partition_list` is also used when writing to table, in https://github.com/apache/datafusion/blob/28e4c64dc738227cd6a4cdf7db48685338582c04/datafusion/core/src/datasource/listing/table.rs#L1001. Is it a problem to ignore empty files there? ## Are these changes tested? Added empty file into existing tests for `pruned_partition_list`, as well as new tests for `list_partitions`. Table.rs didn't seem to have any tests for schema inference, so I didn't add anything for it. ## Are there any user-facing changes? Reading input ListingTables containing empty files now succeeds. <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org