pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2945139509
Changing hats to DataFusion user mode where I need to make sure that the end users of our system can press 'cancel' at any time and that works as expected. From that perspective here's a possible useful test case (and maybe an illustration of a more general problem): suppose you have a query like `select sum(size) as sum from t group by name order by sum` that produces a large number of distinct groups. The query plan for this today is: ``` SortPreservingMergeExec: [sum@0 ASC NULLS LAST] SortExec: expr=[sum@0 ASC NULLS LAST], preserve_partitioning=[true] ProjectionExec: expr=[sum(t.size)@1 as sum] AggregateExec: mode=FinalPartitioned, gby=[name@0 as name], aggr=[sum(t.size)] CoalesceBatchesExec: target_batch_size=8192 RepartitionExec: partitioning=Hash([name@0], 10), input_partitions=10 AggregateExec: mode=Partial, gby=[name@0 as name], aggr=[sum(t.size)] RepartitionExec: partitioning=RoundRobinBatch(10), input_partitions=1 DataSourceExec: file_groups={1 group: [[<file name>]]}, projection=[name, size], file_type=... ``` If I'm reading the code correctly, once the `FinalPartitioned` aggregation has drained the original input it may switch over to reading back spill files. At that point the original input (and yield exec wrapping it) are taken out of the picture. Unless I'm mistaken, once the query hits that phase it may not be interruptible again unless some yield guarantee is injected again. I don't have a good idea for how to write a practical test case for this though. You would have to drive a sufficiently large query all the way to this point to be able to observer the behavior. I wonder if this illustrates that only analyzing the static picture of the query at planning time is insufficient because it does not (and probably cannot) take the dynamic behavior of the query into account. The actual tree of streams and the points where you might need yield wrappers can change as the query is executing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org