pepijnve commented on PR #16319: URL: https://github.com/apache/datafusion/pull/16319#issuecomment-2959103958
> In what situations would these changes lead to better performance? I.e. why is query 28 28: ~ 1.10x faster? The jury is still out on whether it makes sense or not. I can explain my theoretical reasoning. Apologies up front if I'm writing too pedantically. Just trying to explain things as clearly as I can. Not a database guy so this may sounds hopelessly naive. The first observation is that yielding all the way to the runtime in Tokio requires stack unwinding. That's the nature of stackless tasks. The deeper your call stack is, the more call frames you need to unwind and the more calls you'll need to redo to get back to the yield point. I've been trying to find information on whether the Rust compiler does some magic to avoid this, but as far as I can tell that's not the case. I did find hints that it optimizes nested async function calls, but it will not do so for nested dyn Stream poll_next calls. Makes sense; an aot compiler will typically not be able to optimize across virtual function calls. The consequence is that yielding to the runtime can have a non trivial cost. The other PR you're reviewing is an extreme example of that. Second observation is that DataFusion's volcano model naturally leads to fairly deep call stacks. The tree of execution plans results in a tree of streams and a parent stream's poll_next will often directly call poll_next on a child. If you get one of these deep call stacks, yielding from the deepest point potentially means unwinding the whole thing and coming back. This is mitigated a bit already when volcano breaking operators like repartition are present in the plan. The deepest call stacks are seen when running with `target_partitions = 1`. Third, pipeline breaking operators are intrinsically two-phase. First they collect, then they emit. There's a gray area of course, but I'm talking about the classic ones like single aggregation. While a pipeline breaking stream is in its collect phase, it can be 100% sure that it will not have any new data for poll_next caller until that phase completes. There's really not much point in telling the caller `Poll::Pending` over and over again because that leads to busy waiting. But you do still want to yield to the runtime periodically to not squat the Tokio executor threads. So there are situations where there are potentially long phases where any yield to the caller is redundant (there's no new info), but you still need to yield for cooperative scheduling. Combing all that I think you're looking for deep query plans with nested pipeline breakers. In a different PR someone pointed me to this query https://github.com/apache/datafusion/blob/bf7859e5d9dbdc260674f5333a5cafa9c6e7bc12/datafusion/sqllogictest/test_files/window.slt#L3020 The nested sorts in the physical plan are something of a worst case scenario. At the deepest sort you have a 12 level deep call stack that gets reactivated for every yield. If instead we chop this into 6 chained spawned task, you get 6 much shallower call stacks. Of those tasks only one will be active, the other ones will be inert until there's actually something to do. A second factor can be the data source. Filesystem streams tend to be always ready, others may not. The more the source returns pending the more you'll see the overhead described above show up I think. All of this assumes of course that going up and back down the call stack has a non trivial cost. Perhaps it's not significant enough to be measurable. I'm still figuring out how to best profile this stuff, so I'm afraid I don't have anything more fact based to give you yet. Besides performance there's a small aesthetic aspect to this. I find that a stream that responds with `pending <wake> ready ready ready none` is more elegant than `pending pending pending pending ... pending ready ready ready none` The first one abstracts what's going on underneath better than the letter. But I understand that raw performance trumps aesthetics here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org