pepijnve commented on PR #16319:
URL: https://github.com/apache/datafusion/pull/16319#issuecomment-2959103958

   > In what situations would these changes lead to better performance? I.e. 
why is query 28 28: ~ 1.10x faster?
   
   The jury is still out on whether it makes sense or not. I can explain my 
theoretical reasoning. Apologies up front if I'm writing too pedantically. Just 
trying to explain things as clearly as I can. Not a database guy so this may 
sounds hopelessly naive.
   
   The first observation is that yielding all the way to the runtime in Tokio 
requires stack unwinding. That's the nature of stackless tasks. The deeper your 
call stack is, the more call frames you need to unwind and the more calls 
you'll need to redo to get back to the yield point. I've been trying to find 
information on whether the Rust compiler does some magic to avoid this, but as 
far as I can tell that's not the case. I did find hints that it optimizes 
nested async function calls, but it will not do so for nested dyn Stream 
poll_next calls. Makes sense; an aot compiler will typically not be able to 
optimize across virtual function calls.
   The consequence is that yielding to the runtime can have a non trivial cost. 
The other PR you're reviewing is an extreme example of that.
   
   Second observation is that DataFusion's volcano model naturally leads to 
fairly deep call stacks. The tree of execution plans results in a tree of 
streams and a parent stream's poll_next will often directly call poll_next on a 
child. If you get one of these deep call stacks, yielding from the deepest 
point potentially means unwinding the whole thing and coming back. This is 
mitigated a bit already when volcano breaking operators like repartition are 
present in the plan. The deepest call stacks are seen when running with 
`target_partitions = 1`.
   
   Third, pipeline breaking operators are intrinsically two-phase. First they 
collect, then they emit. There's a gray area of course, but I'm talking about 
the classic ones like single aggregation. While a pipeline breaking stream is 
in its collect phase, it can be 100% sure that it will not have any new data 
for poll_next caller until that phase completes. There's really not much point 
in telling the caller `Poll::Pending` over and over again because that leads to 
busy waiting. But you do still want to yield to the runtime periodically to not 
squat the Tokio executor threads.
   So there are situations where there are potentially long phases where any 
yield to the caller is redundant (there's no new info), but you still need to 
yield for cooperative scheduling.
   
   Combing all that I think you're looking for deep query plans with nested 
pipeline breakers. In a different PR someone pointed me to this query 
https://github.com/apache/datafusion/blob/bf7859e5d9dbdc260674f5333a5cafa9c6e7bc12/datafusion/sqllogictest/test_files/window.slt#L3020
   The nested sorts in the physical plan are something of a worst case 
scenario. At the deepest sort you have a 12 level deep call stack that gets 
reactivated for every yield. If instead we chop this into 6 chained spawned 
task, you get 6 much shallower call stacks. Of those tasks only one will be 
active, the other ones will be inert until there's actually something to do.
   
   A second factor can be the data source. Filesystem streams tend to be always 
ready, others may not. The more the source returns pending the more you'll see 
the overhead described above show up I think.
   
   All of this assumes of course that going up and back down the call stack has 
a non trivial cost. Perhaps it's not significant enough to be measurable. I'm 
still figuring out how to best profile this stuff, so I'm afraid I don't have 
anything more fact based to give you yet.
   
   Besides performance there's a small aesthetic aspect to this. I find that a 
stream that responds with
   `pending <wake> ready ready ready none`
   is more elegant than
   `pending pending pending pending ... pending ready ready ready none`
   The first one abstracts what's going on underneath better than the letter. 
But I understand that raw performance trumps aesthetics here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to