zhuqi-lucas commented on PR #16322:
URL: https://github.com/apache/datafusion/pull/16322#issuecomment-2961564872

   > > Previously, as soon as one stream returned Pending, the merge would 
short-circuit and return Pending, minimizing work per cycle. With the new 
approach, we now poll all N streams once before giving up, which can add extra 
CPU cost, especially when you have high parallelism
   > 
   > @zhuqi-lucas this is a really interesting topic and tricky tradeoff. The 
previous code would indeed do one poll per cycle and then do a self-wake yield. 
The new code polls each stream once and then yields without self-wake. In terms 
of the amount of work done there is no difference, not any extra CPU cost. On 
the contrary, the self-wake yield is removal of pure waste. Unless I'm missing 
something, total elapsed time should be less.
   > 
   > But the tradeoff is that we're going to extend the elapsed time per cycle. 
As a consequence it may take a bit longer to yield to the caller, and then the 
cancellation problem rears its ugly head again.
   > 
   > If this actually matters or not is entirely dependent on the streams being 
polled. SortPreservingMergeExec for instance lets things up to use 
`RecordBatchReceiverStream` instances as children. Polling those repeatedly 
until the first record batch arrives is quite wasteful because you're really 
just spinning in place. Checking each receiver in a loop is going to be super 
quick, and yielding in between every partition is not useful at all. If the 
child streams are blocking though, then it's a different matter. You probably 
don't want to actually drive the progress of each stream synchronously in a 
loop.
   > 
   > So... it's a complicated balancing act. Would love to hear how others look 
at this problem. PR #16319 ties into this. It's an experiment I'm doing to see 
if we can avoid exposing the potentially blocking portion of streams to the 
caller so that the problem described above kind of disappears. It's not yet 
clear if this can be achieved without a performance penalty.
   
   Thank you @pepijnve , some trade-off solution may be:
   
   ```rust
   Trade-off Strategy : Batch Polling
   
   Approach that sits between the two extremes is batch polling.
   Instead of polling all streams every time, you divide the N child streams 
into several groups (e.g., 3 groups).
   
   On each poll cycle, you only poll one group of streams.
   
   On the next cycle, you move to the next group, and so on.
   
   This helps reduce the per-cycle polling cost while still making steady 
progress across all streams over time.
   
   Especially in high-parallelism scenarios where polling all streams on every 
wake-up is unnecessarily expensive.
   ```
   
   And our TPC-H is not enough i believe, we may need to mock those cases... 
But it's really hard.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to