pepijnve opened a new issue, #16318:
URL: https://github.com/apache/datafusion/issues/16318

   ### Is your feature request related to a problem or challenge?
   
   When a query pipeline contains one or more pipeline blockers, the query will 
spend an extended period of time in the blocking phase of the query before it 
starts to emit data. During this time, polling the query stream can either 
block or return `Pending`.
   
   Today quite a few pipeline blocking operators will block. While efficient in 
terms of polling, this prevents the query from being cancelled. The two 
possible solutions to this cancellation problem PR #16196 and #16301 fix this 
by ensuring query pipelines yield sufficiently providing the caller the 
opportunity to cancel.
   A side effect of these changes (for both PR variants) is that the caller 
will now receive `Pending` results much more frequently effectively forcing the 
caller into a busy polling loop. The `Pending`s do not arrive at such a fast 
rate that this is overly problematic, but this is still rather wasteful.
   
   As an illustration of this, I added wrapper stream in the CLI that prints 
out the poll results over time. Using a larger version of the test CSV file 
`main` at the time of writing shows `P` (`Pending`), `B` 
(`Ready(Some(OK(_)))`), and `<EOS>` (`Ready(None)`).
   
   ```
   > select a from annotated_data_infinite2 order by b desc limit 10;
   PB<EOS>
   ```
   
   with the changes from the referenced PRs you get this instead.
   
   ```
   > select a from annotated_data_infinite2 order by b desc limit 10;
   
PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPB<EOS>
   ```
   
   For complex queries, with many nested pipeline blockers, the polling call 
stack can get pretty deep making the waste larger. The `Pending` results 
typically will originate from the deepest point of the call stack. Every time 
this happens, the call stack is unwound and the polling call stack is rebuilt.
   
   In an ideal world we would only return `Pending` once and wake the root 
caller when there is actually work to be done.
   
   ### Describe the solution you'd like
   
   If a query is performing a long running task internally it should return 
`Pending` once and wake the caller only when data is actually available.
   
   ### Describe alternatives you've considered
   
   None yet.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to