asolimando commented on issue #18628: URL: https://github.com/apache/datafusion/issues/18628#issuecomment-3526607780
NDV is also generally used for aggregation pushdown, as it helps understanding how the the number of groups relates with the number of tuples, ranging from the case where you group over a primary key (`#tuple = #groups`) to grouping over a single valued column (`#groups = 1`). Roughly `#groups(c) = NDV(c)`, where `c` is a column (some form of interpolation is needed for multi-columns group-by, starting from individual NDVs for the given columns). https://github.com/apache/datafusion/pull/11627 introduced a runtime optimization to this effect, skipping the partial aggregation when it doesn't see enough reduction, as you would end up doing the same work twice. It would indeed be nice to have the planner make use of NDV to improve the plan. (this seems more an enhancement than a bug, as it's a missed opportunity but it doesn't affect correctness) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
