Dandandan opened a new issue, #20529:
URL: https://github.com/apache/datafusion/issues/20529

   ### Is your feature request related to a problem or challenge?
   
   Current parelllization of Parquet scan is bounded by the thread that has the 
most data / is the slowest to execute, which means in the case of data skew 
(driven by either larger partitions or less selective filters during pruning / 
filter pushdown..., variable object store latency), the parallelism will be 
significantly limited.
   
   
   
   ### Describe the solution you'd like
   
   We can change the strategy by morsel-driven parallelism like described in 
https://db.in.tum.de/~leis/papers/morsels.pdf.
   
   Doing so is faster for a lot of queries, when there is an amount of skew 
(such as clickbench) and we have enough row filters to spread out the work.
   For clickbench_partitioned / clickbench_pushdown it seems up to ~2x as fast 
for some queries, on a 10 core machine.
   
   
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to