suremarc commented on issue #10336:
URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2780386970

   Also, there are two other issues I'd like to call out:
   
   ## Unit tests for 
`FileScanConfig::split_groups_by_statistics_with_target_partitions`
   There are some table-driven unit tests for 
`FileScanConfig::split_groups_by_statistics` but they don't cover the new 
`FileScanConfig::split_groups_by_statistics_with_target_partitions` method 
which is the one that is actually called after #15473. I think we could extend 
the test in the following manner:
   
   1. Run `FileScanConfig::split_groups_by_statistics`
   2. Count the number of file groups produced, call it `N`. This is the 
minimum number of file groups needed
   3. Run `FileScanConfig::split_groups_by_statistics_with_target_partitions` 
with `N` as the target partitions. This should produce the exact same result
   
   ## Inexact statistics
   
   After rereading the code, I do not think the `MinMaxStatistics` data 
structure actually differentiates between exact and inexact statistics. 
Technically this is fine as long as the min/max values are treated as 
conservative estimates, which I believe is how they are used in the 
`FileScanConfig`-level statistics with all of the built-in file formats. 
However, it seems _possible_ that someone could implement their own 
`DataSource`/`FileSource` that doesn't respect this. 
   
   TBH I am tempted to consider such a case a bug on the user's side, but it is 
technically within the semantics of `Statistics` as of today. So we would need 
to wait for migration to StatisticsV2 (migrating from `Precision` to 
`Distribution`)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to