pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2950129956
One performance aspect I've been looking at is the cost of yielding. There's no magic as far as I can tell. Returning a Pending simply leads to a full unwind of the call stack by virtue of the return bubbling up all the way up to the tokio executor and then a full descent back to the point where you left off using function calls. That would suggest it's most interesting to do a cooperative yield from as shallow a point as possible in the call tree rather than from the deepest possible point so that you keep the roundtrip to the executor and back as short as possible. Running with target_partitions = 1, shows that for queries like the deeply nested window/sort query you linked to @ozankabak the call stack can get pretty deep. It's essentially proportionate to the depth of the plan tree. To mitigate this, would it make sense for pipeline breaking operators to run their pipeline breaking portion in a SpawnedTask instead of as child? I'm thinking of the sort phase of sort, the build phase of join, etc. Regardless of how where you inject Pending that seems beneficial to keep the call stack that needs to be unwound shallow. Note that this same argument does suggest it could more interesting to do the cooperative yield where the looping is happening rather than where the data is produced. The loop is the shallowest point, definitely if you spawn a task since that gets you a new root, while the producer is the deepest point. Cutting the call stack using spawned tasks may also mitigate the deeply nested query concern regarding checking for yielding at multiple levels. The yield is never going to go beyond the scope of a single task. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org