Re: [PR] [DRAFT]: refactor TableProvider::scan to push down sorting [datafusion]

via GitHub Fri, 22 Aug 2025 10:21:18 -0700


adriangb commented on PR #17273:
URL: https://github.com/apache/datafusion/pull/17273#issuecomment-3215075113


   Okay this mostly works other than the fact that 
`split_groups_by_statistics_with_target_partitions` does not respect target 
partitions and instead tries to force files into groups so that the ranges 
don't overlap. @alamb this seems like an over-specialization to the influx use 
case, but maybe it's a wider spread use case than I am considering. It's not 
clear to me the tradeoffs of forcing that behavior vs. alternative (e.g. making 
a single group where all of the files are sorted but have overlapping ranges).
   
   This test illustrates what happens:
   
   ```
   copy (select i from generate_series(1, 10) as t(i)) to 
'data/sort/t10.parquet';
   copy (select i from generate_series(9, 20) as t(i)) to 
'data/sort/t9.parquet';
   copy (select i from generate_series(19, 30) as t(i)) to 
'data/sort/t8.parquet';
   copy (select i from generate_series(29, 40) as t(i)) to 
'data/sort/t7.parquet';
   copy (select i from generate_series(39, 50) as t(i)) to 
'data/sort/t6.parquet';
   copy (select i from generate_series(49, 60) as t(i)) to 
'data/sort/t5.parquet';
   copy (select i from generate_series(59, 70) as t(i)) to 
'data/sort/t4.parquet';
   copy (select i from generate_series(69, 80) as t(i)) to 
'data/sort/t3.parquet';
   copy (select i from generate_series(79, 90) as t(i)) to 
'data/sort/t2.parquet';
   copy (select i from generate_series(89, 100) as t(i)) to 
'data/sort/t1.parquet';
   copy (select i from generate_series(99, 110) as t(i)) to 
'data/sort/t0.parquet';
   SET datafusion.execution.target_partitions = '1';
   SET datafusion.execution.split_file_groups_by_statistics = 'true';
   create external table t stored as parquet location 'data/sort/';
   explain analyze
   select *
   from t
   order by i asc
   limit 1;
   ```
   
   1. `split_groups_by_statistics_with_target_partitions` will split these up 
into 2 groups (roughly `[(1,10), (19, 30), ...], [(9, 20), (29, 40), ...]`) : 
https://github.com/apache/datafusion/blob/f363e382661a4f45dad2912e9988f1703e46939b/datafusion/datasource/src/file_scan_config.rs#L857-L874
   2. Because we have 2 groups and 1 target partition this line gets hit and we 
fall back to ordering by file path: 
https://github.com/apache/datafusion/blob/f363e382661a4f45dad2912e9988f1703e46939b/datafusion/core/src/datasource/listing/table.rs#L1224
   
   
   That said we (Pydantic) don't use that function (nor do we use ListingTable) 
so just the refactor to make the sorting information available to TableProvider 
is good enough for us. To get that optimization to everyone using ListingTable 
/ `datafusion-cli` I think we'll have to have a think about how hardcoded the 
assumption of a desire for non-overlapping ranges is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] [DRAFT]: refactor TableProvider::scan to push down sorting [datafusion]

Reply via email to