Blizzara opened a new pull request, #13750:
URL: https://github.com/apache/datafusion/pull/13750

   ## Which issue does this PR close?
   
   Ignore empty files in ListingTable. Sometimes input datasets can contain 
empty files (as in 0 bytes), and trying to treat them like normal files fails 
when e.g. reading parquet metadata. 
   
   Closes https://github.com/apache/datafusion/issues/13737.
   
   ## Rationale for this change
   
   Empty files cannot contribute anything to the table, other than to break 
things, so ignoring them is pretty much strictly better. Also aligns with e.g. 
[Spark](https://github.com/apache/spark/blob/b2c8b3069ef4f5288a5964af0da6f6b23a769e6b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L82C9-L82C23).
   
   ## What changes are included in this PR?
   
   Ignore empty files in ListingTable when listing files with or without 
partition filters, as well as when inferring schema
   
   One thing I'm not sure about is that it seems `pruned_partition_list` is 
also used when writing to table, in 
https://github.com/apache/datafusion/blob/28e4c64dc738227cd6a4cdf7db48685338582c04/datafusion/core/src/datasource/listing/table.rs#L1001.
 Is it a problem to ignore empty files there?
   
   ## Are these changes tested?
   
   Added empty file into existing tests for `pruned_partition_list`, as well as 
new tests for `list_partitions`. Table.rs didn't seem to have any tests for 
schema inference, so I didn't add anything for it.
   
   ## Are there any user-facing changes?
   
   Reading input ListingTables containing empty files now succeeds.
   
   <!--
   If there are any breaking changes to public APIs, please add the `api 
change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to