ch-sc opened a new issue, #14237:
URL: https://github.com/apache/datafusion/issues/14237

   ### Is your feature request related to a problem or challenge?
   
   Today, statistics of filter predicates are based on interval arithmetic 
invoked by `PhysicalExec::evaluate_bounds()`. This works fine for numerical 
data. However, many expressions and datatypes are not supported by interval 
arithmetics and therefore proper selectivity prediction is not supported for 
such expressions.
   
   I noticed there were lots of discussions regarding statistics in the project 
lately. Work by folks from Synnada and others is currently in progress. If you 
feel like this issue is already addressed please let me know. I'd like to offer 
help with open tasks then.
   
   
   ### Describe the solution you'd like
   
   1. Add support for some missing stuff in interval arithmetics, i.e., 
temporal data.
   
   2. Add `PhysicalExpr::evaluate_statistics()` to calculate expression level 
statistic. This was already proposed by others.
   
   My suggestion is the following signature:
   ```rust
   fn evaluate_statistics(&self, input_statistics: &Statistics) -> 
Result<ExpressionStatistics>
   ```
   
   I think this should return a new statistics struct on expression level which 
could look like this: 
   
   ```rust
   pub struct ExpressionStatistics {
       /// Number of null values
       pub null_count: Precision<usize>,
       /// number of output rows (cardinality)
       pub num_rows: Precision<ScalarValue>,
       /// total number of input rows 
       pub total_rows: Precision<ScalarValue>,
       /// Number of distinct values
       pub distinct_count: Precision<usize>,
   }
   ```
   
   With `evaluate_statistics()` we add support for filter expressions such as 
string comparisons, `InList`, `LikeExpr`, or binary operators like 
`IS_DISTINCT_FROM`, `IS_NOT_DISTINCT_FROM`. It may be an iterative approach 
where we start with a few expression types and take it from there. 
   
   Selectivity calculation is trivial: `num_rows/total_rows`.
   
   We can utilise `evaluate_bounds()` for supported expressions. For example, 
from `2*A > B` we get its target boundaries and calculate the selectivity as is 
done in `analysis::calculate_selectivity()`. 
   
   ```rust
   fn calculate_selectivity(
       target_boundaries: &[ExprBoundaries],
       initial_boundaries: &[ExprBoundaries],
   ) -> f64 {
       // Since the intervals are assumed uniform and the values
       // are not correlated, we need to multiply the selectivities
       // of multiple columns to get the overall selectivity.
       initial_boundaries
           .iter()
           .zip(target_boundaries.iter())
           .fold(1.0, |acc, (initial, target)| {
               acc * cardinality_ratio(&initial.interval, &target.interval)
           })
   }
   ```
   This naive approach assumes uni-distributed data. Heuristics, like various 
distribution types, could be added to `ExpressionStatisticsa` too. For the sake 
of simplicity I will not address this here. 
   
   
   Happy to receive some feedback 🙂 
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   Short disclaimer: I work for Coralogix like some other datafusion 
contributors.
   
   cc: @thinkharderdev


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to