pepijnve commented on PR #16196:
URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2945139509

   Changing hats to DataFusion user mode where I need to make sure that the end 
users of our system can press 'cancel' at any time and that works as expected.
   
   From that perspective here's a possible useful test case (and maybe an 
illustration of a more general problem): suppose you have a query like `select 
sum(size) as sum from t group by name order by sum` that produces a large 
number of distinct groups. The query plan for this today is:
   
   ```
   SortPreservingMergeExec: [sum@0 ASC NULLS LAST]
     SortExec: expr=[sum@0 ASC NULLS LAST], preserve_partitioning=[true]
       ProjectionExec: expr=[sum(t.size)@1 as sum]
         AggregateExec: mode=FinalPartitioned, gby=[name@0 as name], 
aggr=[sum(t.size)]
           CoalesceBatchesExec: target_batch_size=8192
             RepartitionExec: partitioning=Hash([name@0], 10), 
input_partitions=10
               AggregateExec: mode=Partial, gby=[name@0 as name], 
aggr=[sum(t.size)]
                 RepartitionExec: partitioning=RoundRobinBatch(10), 
input_partitions=1
                   DataSourceExec: file_groups={1 group: [[<file name>]]}, 
projection=[name, size], file_type=...
   ```
   
   If I'm reading the code correctly, once the `FinalPartitioned` aggregation 
has drained the original input it may switch over to reading back spill files. 
At that point the original input (and yield exec wrapping it) are taken out of 
the picture. Unless I'm mistaken, once the query hits that phase it may not be 
interruptible again unless some yield guarantee is injected again.
   I don't have a good idea for how to write a practical test case for this 
though. You would have to drive a sufficiently large query all the way to this 
point to be able to observer the behavior.
   
   I wonder if this illustrates that only analyzing the static picture of the 
query at planning time is insufficient because it does not (and probably 
cannot) take the dynamic behavior of the query into account. The actual tree of 
streams and the points where you might need yield wrappers can change as the 
query is executing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to