alamb commented on issue #14231:
URL: https://github.com/apache/datafusion/issues/14231#issuecomment-2610159122

   The description of the issue on this ticket makes sense to me
   
   > AggregateExec sets its required_input_ordering based solely on its 
group-by expressions without checking any configuration like 
prefer_existing_sort. This effectively makes the ordering a hard requirement.
   
   Yes I agree this is non ideal
   
   > While AggregateExec benefits from receiving ordered input, adding a 
SortExec in this context can incur a significant performance cost, negating any 
benefits of preserving the order.
   
   100% agree
   
   > A straightforward approach could involve AggregateExec respecting the 
prefer_existing_sort configuration before adding ordering requirements. 
However, this introduces challenges:
   
   Agree. Looking at the code, it seems to me like `input_order_mode` and 
`sort_requirement` are inter-related concepts but that is not currently 
reflected in the code
   
   
https://github.com/apache/datafusion/blob/5edb27678d1b97e74bb83d185166980c901a9b06/datafusion/physical-plan/src/aggregates/mod.rs#L491-L500
   
   What about the following:
   1. If the input is **already** sorted by the appropriate group keys, set 
`input_order_mode` appropriately *and* set the required input sort order (to 
match the existing sort order)
   2. If the input is *not* already sorted on the appropriate group keys, then 
set `input_order_mode` to `InputOrderMode::Linear` and set 
required_sort_expression to None (aka don't resort the data)
   
   This would mean:
   1. The optimizer can take advantage of *existing* sort orders
   2. Will not resort input if it isn't already sorted
   
   The downside is that the optimizer would not resort data even when it might 
be beneficial (e.g. if the data could be sorted cheaply with a prefix and then 
use a smaller hash table).  However making that optimization work requires a 
choice between partial sort + ordered group by is better than a hash group by. 
   
   Our optimizer today has no cost model / framework to evaluate such tradeoffs 
(and there are known challenges with cost estimate based optimizers anyways)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to