ch-sc opened a new issue, #14237: URL: https://github.com/apache/datafusion/issues/14237
### Is your feature request related to a problem or challenge? Today, statistics of filter predicates are based on interval arithmetic invoked by `PhysicalExec::evaluate_bounds()`. This works fine for numerical data. However, many expressions and datatypes are not supported by interval arithmetics and therefore proper selectivity prediction is not supported for such expressions. I noticed there were lots of discussions regarding statistics in the project lately. Work by folks from Synnada and others is currently in progress. If you feel like this issue is already addressed please let me know. I'd like to offer help with open tasks then. ### Describe the solution you'd like 1. Add support for some missing stuff in interval arithmetics, i.e., temporal data. 2. Add `PhysicalExpr::evaluate_statistics()` to calculate expression level statistic. This was already proposed by others. My suggestion is the following signature: ```rust fn evaluate_statistics(&self, input_statistics: &Statistics) -> Result<ExpressionStatistics> ``` I think this should return a new statistics struct on expression level which could look like this: ```rust pub struct ExpressionStatistics { /// Number of null values pub null_count: Precision<usize>, /// number of output rows (cardinality) pub num_rows: Precision<ScalarValue>, /// total number of input rows pub total_rows: Precision<ScalarValue>, /// Number of distinct values pub distinct_count: Precision<usize>, } ``` With `evaluate_statistics()` we add support for filter expressions such as string comparisons, `InList`, `LikeExpr`, or binary operators like `IS_DISTINCT_FROM`, `IS_NOT_DISTINCT_FROM`. It may be an iterative approach where we start with a few expression types and take it from there. Selectivity calculation is trivial: `num_rows/total_rows`. We can utilise `evaluate_bounds()` for supported expressions. For example, from `2*A > B` we get its target boundaries and calculate the selectivity as is done in `analysis::calculate_selectivity()`. ```rust fn calculate_selectivity( target_boundaries: &[ExprBoundaries], initial_boundaries: &[ExprBoundaries], ) -> f64 { // Since the intervals are assumed uniform and the values // are not correlated, we need to multiply the selectivities // of multiple columns to get the overall selectivity. initial_boundaries .iter() .zip(target_boundaries.iter()) .fold(1.0, |acc, (initial, target)| { acc * cardinality_ratio(&initial.interval, &target.interval) }) } ``` This naive approach assumes uni-distributed data. Heuristics, like various distribution types, could be added to `ExpressionStatisticsa` too. For the sake of simplicity I will not address this here. Happy to receive some feedback 🙂 ### Describe alternatives you've considered _No response_ ### Additional context Short disclaimer: I work for Coralogix like some other datafusion contributors. cc: @thinkharderdev -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org