adriangb commented on PR #17273: URL: https://github.com/apache/datafusion/pull/17273#issuecomment-3215075113
Okay this mostly works other than the fact that `split_groups_by_statistics_with_target_partitions` does not respect target partitions and instead tries to force files into groups so that the ranges don't overlap. @alamb this seems like an over-specialization to the influx use case, but maybe it's a wider spread use case than I am considering. It's not clear to me the tradeoffs of forcing that behavior vs. alternative (e.g. making a single group where all of the files are sorted but have overlapping ranges). This test illustrates what happens: ``` copy (select i from generate_series(1, 10) as t(i)) to 'data/sort/t10.parquet'; copy (select i from generate_series(9, 20) as t(i)) to 'data/sort/t9.parquet'; copy (select i from generate_series(19, 30) as t(i)) to 'data/sort/t8.parquet'; copy (select i from generate_series(29, 40) as t(i)) to 'data/sort/t7.parquet'; copy (select i from generate_series(39, 50) as t(i)) to 'data/sort/t6.parquet'; copy (select i from generate_series(49, 60) as t(i)) to 'data/sort/t5.parquet'; copy (select i from generate_series(59, 70) as t(i)) to 'data/sort/t4.parquet'; copy (select i from generate_series(69, 80) as t(i)) to 'data/sort/t3.parquet'; copy (select i from generate_series(79, 90) as t(i)) to 'data/sort/t2.parquet'; copy (select i from generate_series(89, 100) as t(i)) to 'data/sort/t1.parquet'; copy (select i from generate_series(99, 110) as t(i)) to 'data/sort/t0.parquet'; SET datafusion.execution.target_partitions = '1'; SET datafusion.execution.split_file_groups_by_statistics = 'true'; create external table t stored as parquet location 'data/sort/'; explain analyze select * from t order by i asc limit 1; ``` 1. `split_groups_by_statistics_with_target_partitions` will split these up into 2 groups (roughly `[(1,10), (19, 30), ...], [(9, 20), (29, 40), ...]`) : https://github.com/apache/datafusion/blob/f363e382661a4f45dad2912e9988f1703e46939b/datafusion/datasource/src/file_scan_config.rs#L857-L874 2. Because we have 2 groups and 1 target partition this line gets hit and we fall back to ordering by file path: https://github.com/apache/datafusion/blob/f363e382661a4f45dad2912e9988f1703e46939b/datafusion/core/src/datasource/listing/table.rs#L1224 That said we (Pydantic) don't use that function (nor do we use ListingTable) so just the refactor to make the sorting information available to TableProvider is good enough for us. To get that optimization to everyone using ListingTable / `datafusion-cli` I think we'll have to have a think about how hardcoded the assumption of a desire for non-overlapping ranges is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org