Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

via GitHub Thu, 27 Mar 2025 06:41:32 -0700


suremarc commented on issue #10336:
URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2758082825


   Leaving some thoughts here as I was asked in [another 
issue](https://github.com/apache/datafusion/issues/15191#issuecomment-2756831956)
 about what it would take to turn this feature on, and I don't want to take 
over that thread --
   
   1. As pointed out in 
[comment](https://github.com/apache/datafusion/issues/10336#issuecomment-2246022064),
 the current impl is too aggressive about limiting parallelism, so we will need 
to modify the first fit algorithm to distribute into at least k groups. This 
can maybe be done if we initialize k empty groups and then use some sort of 
priority queue in the first fit algorithm that prioritizes the smallest groups. 
   2. I am worried about the performance of this feature for large 
ListingTables with tens or hundreds of thousands of files. Collecting 
statistics for that many files and sorting by the minimum values is going to 
take up a measurable amount of planning time at that scale, so we may cause a 
performance regression. IMO, we need to set up a benchmark on a large 
ListingTable to quantify the impact. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

Reply via email to