Dandandan opened a new issue, #20529: URL: https://github.com/apache/datafusion/issues/20529
### Is your feature request related to a problem or challenge? Current parelllization of Parquet scan is bounded by the thread that has the most data / is the slowest to execute, which means in the case of data skew (driven by either larger partitions or less selective filters during pruning / filter pushdown..., variable object store latency), the parallelism will be significantly limited. ### Describe the solution you'd like We can change the strategy by morsel-driven parallelism like described in https://db.in.tum.de/~leis/papers/morsels.pdf. Doing so is faster for a lot of queries, when there is an amount of skew (such as clickbench) and we have enough row filters to spread out the work. For clickbench_partitioned / clickbench_pushdown it seems up to ~2x as fast for some queries, on a 10 core machine. ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
