pepijnve opened a new issue, #16318: URL: https://github.com/apache/datafusion/issues/16318
### Is your feature request related to a problem or challenge? When a query pipeline contains one or more pipeline blockers, the query will spend an extended period of time in the blocking phase of the query before it starts to emit data. During this time, polling the query stream can either block or return `Pending`. Today quite a few pipeline blocking operators will block. While efficient in terms of polling, this prevents the query from being cancelled. The two possible solutions to this cancellation problem PR #16196 and #16301 fix this by ensuring query pipelines yield sufficiently providing the caller the opportunity to cancel. A side effect of these changes (for both PR variants) is that the caller will now receive `Pending` results much more frequently effectively forcing the caller into a busy polling loop. The `Pending`s do not arrive at such a fast rate that this is overly problematic, but this is still rather wasteful. As an illustration of this, I added wrapper stream in the CLI that prints out the poll results over time. Using a larger version of the test CSV file `main` at the time of writing shows `P` (`Pending`), `B` (`Ready(Some(OK(_)))`), and `<EOS>` (`Ready(None)`). ``` > select a from annotated_data_infinite2 order by b desc limit 10; PB<EOS> ``` with the changes from the referenced PRs you get this instead. ``` > select a from annotated_data_infinite2 order by b desc limit 10; PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPB<EOS> ``` For complex queries, with many nested pipeline blockers, the polling call stack can get pretty deep making the waste larger. The `Pending` results typically will originate from the deepest point of the call stack. Every time this happens, the call stack is unwound and the polling call stack is rebuilt. In an ideal world we would only return `Pending` once and wake the root caller when there is actually work to be done. ### Describe the solution you'd like If a query is performing a long running task internally it should return `Pending` once and wake the caller only when data is actually available. ### Describe alternatives you've considered None yet. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org