Re: [PR] Introduce morsel-driven Parquet scan [datafusion]

via GitHub Tue, 24 Feb 2026 16:10:01 -0800


Dandandan commented on PR #20481:
URL: https://github.com/apache/datafusion/pull/20481#issuecomment-3955464814


   > I am not quite sure this is entirely the same as "morsel driven 
parallelism" as described in Viktor's paper. I personally might propose calling 
this feature "work stealing scans" or "dynamic scheduling" rather than morsel 
driven parallelism as that
   
   Good to hear your opinion on this! I agree it doesn't implement everything 
from the paper.
   I was reading the paper more in depth in recent days and concluded that the 
essence boils down to those things:
   
   * thread per core (in our case already true when setting `target_partitions` 
to cpu cores)
   * morsel = a large enough unit of work, large enough for communication 
overhead to be low, small enough to introduce enough parallelism
   * work stealing / a queue with spreading "morsels" on the threads
   
   I tried to see if there's _much_ more to it - the main thing I can think of 
is that the way of executing a plan _seems_ still "volcano style" as an 
`ExecutionPlan::execute` and not a higher order "dispatcher"/"scheduler" takes 
care of it (which would probably still have some benefit, maybe for better 
resource usage in hash repartition ).
   
   Other optimizations/operators in the paper are I think ~90% there already.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Introduce morsel-driven Parquet scan [datafusion]

Reply via email to