suremarc commented on issue #10336: URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2780386970
Also, there are two other issues I'd like to call out: ## Unit tests for `FileScanConfig::split_groups_by_statistics_with_target_partitions` There are some table-driven unit tests for `FileScanConfig::split_groups_by_statistics` but they don't cover the new `FileScanConfig::split_groups_by_statistics_with_target_partitions` method which is the one that is actually called after #15473. I think we could extend the test in the following manner: 1. Run `FileScanConfig::split_groups_by_statistics` 2. Count the number of file groups produced, call it `N`. This is the minimum number of file groups needed 3. Run `FileScanConfig::split_groups_by_statistics_with_target_partitions` with `N` as the target partitions. This should produce the exact same result ## Inexact statistics After rereading the code, I do not think the `MinMaxStatistics` data structure actually differentiates between exact and inexact statistics. Technically this is fine as long as the min/max values are treated as conservative estimates, which I believe is how they are used in the `FileScanConfig`-level statistics with all of the built-in file formats. However, it seems _possible_ that someone could implement their own `DataSource`/`FileSource` that doesn't respect this. TBH I am tempted to consider such a case a bug on the user's side, but it is technically within the semantics of `Statistics` as of today. So we would need to wait for migration to StatisticsV2 (migrating from `Precision` to `Distribution`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org