pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2922223237
> yes, it sounds better. It will bring less friction and will be more inclusive Think about this a bit more, it's an interesting design question whose responsibility it is to yield. While the cancellation problem is solvable by yielding at the producer side (i.e. FileStream (or a wrapper around it)), it's more clear/obvious why this is being done (I think) at the consumer side where the loop calling `poll_next` is located. Returning a Pending from the producer is also not a guarantee that cancellation will work correctly. As an example, my colleague implemented an execution plan that polls a number of input streams in a round robin fashion. If a stream returns Pending, rather than returning Pending itself right away it will first check if any of the other streams has data available. I don't think there's anything inherently wrong with that, but the effect is that any intermediate Pending coming from a FileStream would just get gobbled up. The downside is indeed that every implementation that tries to completely drain one or more of its input streams (aggregation, the build side of joins, etc) needs to ensure it yields every now and then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org