alamb commented on issue #14231: URL: https://github.com/apache/datafusion/issues/14231#issuecomment-2610159122
The description of the issue on this ticket makes sense to me > AggregateExec sets its required_input_ordering based solely on its group-by expressions without checking any configuration like prefer_existing_sort. This effectively makes the ordering a hard requirement. Yes I agree this is non ideal > While AggregateExec benefits from receiving ordered input, adding a SortExec in this context can incur a significant performance cost, negating any benefits of preserving the order. 100% agree > A straightforward approach could involve AggregateExec respecting the prefer_existing_sort configuration before adding ordering requirements. However, this introduces challenges: Agree. Looking at the code, it seems to me like `input_order_mode` and `sort_requirement` are inter-related concepts but that is not currently reflected in the code https://github.com/apache/datafusion/blob/5edb27678d1b97e74bb83d185166980c901a9b06/datafusion/physical-plan/src/aggregates/mod.rs#L491-L500 What about the following: 1. If the input is **already** sorted by the appropriate group keys, set `input_order_mode` appropriately *and* set the required input sort order (to match the existing sort order) 2. If the input is *not* already sorted on the appropriate group keys, then set `input_order_mode` to `InputOrderMode::Linear` and set required_sort_expression to None (aka don't resort the data) This would mean: 1. The optimizer can take advantage of *existing* sort orders 2. Will not resort input if it isn't already sorted The downside is that the optimizer would not resort data even when it might be beneficial (e.g. if the data could be sorted cheaply with a prefix and then use a smaller hash table). However making that optimization work requires a choice between partial sort + ordered group by is better than a hash group by. Our optimizer today has no cost model / framework to evaluate such tradeoffs (and there are known challenges with cost estimate based optimizers anyways) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org