Re: [PR] feat: use spawned tasks to reduce call stack depth and avoid busy waiting [datafusion]

via GitHub Tue, 10 Jun 2025 05:58:23 -0700


pepijnve commented on PR #16319:
URL: https://github.com/apache/datafusion/pull/16319#issuecomment-2959103958

> In what situations would these changes lead to better performance? I.e.
why is query 28 28: ~ 1.10x faster?

The jury is still out on whether it makes sense or not. I can explain my
theoretical reasoning. Apologies up front if I'm writing too pedantically. Just
trying to explain things as clearly as I can. Not a database guy so this may
sounds hopelessly naive.

The first observation is that yielding all the way to the runtime in Tokio
requires stack unwinding. That's the nature of stackless tasks. The deeper your
call stack is, the more call frames you need to unwind and the more calls
you'll need to redo to get back to the yield point. I've been trying to find
information on whether the Rust compiler does some magic to avoid this, but as
far as I can tell that's not the case. I did find hints that it optimizes
nested async function calls, but it will not do so for nested dyn Stream
poll_next calls. Makes sense; an aot compiler will typically not be able to
optimize across virtual function calls.
The consequence is that yielding to the runtime can have a non trivial cost.
The other PR you're reviewing is an extreme example of that.

Second observation is that DataFusion's volcano model naturally leads to
fairly deep call stacks. The tree of execution plans results in a tree of
streams and a parent stream's poll_next will often directly call poll_next on a
child. If you get one of these deep call stacks, yielding from the deepest
point potentially means unwinding the whole thing and coming back. This is
mitigated a bit already when volcano breaking operators like repartition are
present in the plan. The deepest call stacks are seen when running with
`target_partitions = 1`.

Third, pipeline breaking operators are intrinsically two-phase. First they
collect, then they emit. There's a gray area of course, but I'm talking about
the classic ones like single aggregation. While a pipeline breaking stream is
in its collect phase, it can be 100% sure that it will not have any new data
for poll_next caller until that phase completes. There's really not much point
in telling the caller `Poll::Pending` over and over again because that leads to
busy waiting. But you do still want to yield to the runtime periodically to not
squat the Tokio executor threads.
So there are situations where there are potentially long phases where any
yield to the caller is redundant (there's no new info), but you still need to
yield for cooperative scheduling.

Combing all that I think you're looking for deep query plans with nested
pipeline breakers. In a different PR someone pointed me to this query
https://github.com/apache/datafusion/blob/bf7859e5d9dbdc260674f5333a5cafa9c6e7bc12/datafusion/sqllogictest/test_files/window.slt#L3020
The nested sorts in the physical plan are something of a worst case
scenario. At the deepest sort you have a 12 level deep call stack that gets
reactivated for every yield. If instead we chop this into 6 chained spawned
task, you get 6 much shallower call stacks. Of those tasks only one will be
active, the other ones will be inert until there's actually something to do.

A second factor can be the data source. Filesystem streams tend to be always
ready, others may not. The more the source returns pending the more you'll see
the overhead described above show up I think.

All of this assumes of course that going up and back down the call stack has
a non trivial cost. Perhaps it's not significant enough to be measurable. I'm
still figuring out how to best profile this stuff, so I'm afraid I don't have
anything more fact based to give you yet.

Besides performance there's a small aesthetic aspect to this. I find that a
stream that responds with
`pending <wake> ready ready ready none`
is more elegant than
`pending pending pending pending ... pending ready ready ready none`
The first one abstracts what's going on underneath better than the letter.
But I understand that raw performance trumps aesthetics here.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: use spawned tasks to reduce call stack depth and avoid busy waiting [datafusion]

Reply via email to