pepijnve commented on PR #16196:
URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2950129956

   One performance aspect I've been looking at is the cost of yielding. There's 
no magic as far as I can tell. Returning a Pending simply leads to a full 
unwind of the call stack by virtue of the return bubbling up all the way up to 
the tokio executor and then a full descent back to the point where you left off 
using function calls.
   That would suggest it's most interesting to do a cooperative yield from as 
shallow a point as possible in the call tree rather than from the deepest 
possible point so that you keep the roundtrip to the executor and back as short 
as possible.
   
   Running with target_partitions = 1, shows that for queries like the deeply 
nested window/sort query you linked to @ozankabak the call stack can get pretty 
deep. It's essentially proportionate to the depth of the plan tree.
   
   To mitigate this, would it make sense for pipeline breaking operators to run 
their pipeline breaking portion in a SpawnedTask instead of as child? I'm 
thinking of the sort phase of sort, the build phase of join, etc. Regardless of 
how where you inject Pending that seems beneficial to keep the call stack that 
needs to be unwound shallow.
   
   Note that this same argument does suggest it could more interesting to do 
the cooperative yield where the looping is happening rather than where the data 
is produced. The loop is the shallowest point, definitely if you spawn a task 
since that gets you a new root, while the producer is the deepest point.
   
   Cutting the call stack using spawned tasks may also mitigate the deeply 
nested query concern regarding checking for yielding at multiple levels. The 
yield is never going to go beyond the scope of a single task.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to