crepererum commented on issue #14287:
URL: https://github.com/apache/datafusion/issues/14287#issuecomment-2615351468

   IIRC the `RepartitionExec` has the following requirements:
   
   - **performance:** "one bit lock" is probably a no-go, esp. in a wide MPMC 
setup.
   - **feed ALL consumers:** Some downstream operators assume that they can 
more or less poll all consumer channels. At least we've seen dead-locks in the 
post when `RepartitonExec` assumes that all consumers are polled at the same 
rate. Hence you MUST feed empty consumers.
   - **skewed inputs:** Inputs may be skewed, i.e. provide different data rates 
and lengths.
   - **limited buffering:** A simple implementation that we've once had just 
had a tokio task per input and polled them until they are empty. That's NOT 
gonna work in many production systems since you're going to blow up memory, 
esp. when the outputs are eventually used with LIMIT clauses and/or slow 
processing nodes. This is the reason for the current ["distributor 
channels"](https://github.com/apache/datafusion/blob/0eebc0c7c0ffcd1514f5c6d0f8e2b6d0c69a07f5/datafusion/physical-plan/src/repartition/distributor_channels.rs#L18-L39)
 construct.
   
   The work stealing approach sounds reasonable but is also somewhat a hack. I 
think if you don't know the output polling rate, then distributing data to the 
different outputs at a fixed rate (that's what round-robin does) isn't a great 
idea. I think finding a fast MPMC channel w/ a fixed capacity (to implement 
_limited buffering_) might be good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to