kosiew commented on code in PR #20789:
URL: https://github.com/apache/datafusion/pull/20789#discussion_r2964834917


##########
datafusion/physical-expr/src/analysis.rs:
##########
@@ -277,8 +278,18 @@ fn calculate_selectivity(
     let mut acc: f64 = 1.0;
     for (initial, target) in initial_boundaries.iter().zip(target_boundaries) {
         match (initial.interval.as_ref(), target.interval.as_ref()) {
-            (Some(initial), Some(target)) => {
-                acc *= cardinality_ratio(initial, target);
+            (Some(initial_interval), Some(target_interval)) => {
+                // If it is equality predicate, calculate selectivity as `1 / 
distinct_count`

Review Comment:
   I think this new `1 / distinct_count` branch is a little too broad as 
written. Right now it fires whenever the pruned interval collapses to a single 
value, but that is not quite the same thing as proving we have an equality 
filter.
   
   For example, if the incoming stats already describe a singleton interval, or 
if a conjunction of inequalities narrows the range to one point without 
actually adding any selectivity beyond the existing stats, we would still scale 
by `1 / NDV` here and end up under-estimating the row count.
   
   Could we tighten this so the shortcut only applies when we really learned 
something equality-specific? One possible way would be to compare against 
`initial_interval` and only use this path when the initial interval was not 
already that same singleton. A regression test around an already-singleton 
input would also be really helpful here.



##########
datafusion/physical-expr/src/analysis.rs:
##########
@@ -277,8 +278,18 @@ fn calculate_selectivity(
     let mut acc: f64 = 1.0;
     for (initial, target) in initial_boundaries.iter().zip(target_boundaries) {
         match (initial.interval.as_ref(), target.interval.as_ref()) {
-            (Some(initial), Some(target)) => {
-                acc *= cardinality_ratio(initial, target);
+            (Some(initial_interval), Some(target_interval)) => {
+                // If it is equality predicate, calculate selectivity as `1 / 
distinct_count`

Review Comment:
   Small readability suggestion: would it make sense to move this 
singleton-selectivity logic into a tiny helper, maybe something like 
`singleton_selectivity(initial_interval, target_interval, distinct_count)`?
   
   I think that would make `calculate_selectivity` a bit easier to scan, and it 
would give the equality-vs-singleton rules a single place to live once the 
edge-case handling is tightened up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to