suremarc commented on issue #10336: URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2758082825
Leaving some thoughts here as I was asked in [another issue](https://github.com/apache/datafusion/issues/15191#issuecomment-2756831956) about what it would take to turn this feature on, and I don't want to take over that thread -- 1. As pointed out in [comment](https://github.com/apache/datafusion/issues/10336#issuecomment-2246022064), the current impl is too aggressive about limiting parallelism, so we will need to modify the first fit algorithm to distribute into at least k groups. This can maybe be done if we initialize k empty groups and then use some sort of priority queue in the first fit algorithm that prioritizes the smallest groups. 2. I am worried about the performance of this feature for large ListingTables with tens or hundreds of thousands of files. Collecting statistics for that many files and sorting by the minimum values is going to take up a measurable amount of planning time at that scale, so we may cause a performance regression. IMO, we need to set up a benchmark on a large ListingTable to quantify the impact. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org