kosiew commented on code in PR #20789:
URL: https://github.com/apache/datafusion/pull/20789#discussion_r2964834917
##########
datafusion/physical-expr/src/analysis.rs:
##########
@@ -277,8 +278,18 @@ fn calculate_selectivity(
let mut acc: f64 = 1.0;
for (initial, target) in initial_boundaries.iter().zip(target_boundaries) {
match (initial.interval.as_ref(), target.interval.as_ref()) {
- (Some(initial), Some(target)) => {
- acc *= cardinality_ratio(initial, target);
+ (Some(initial_interval), Some(target_interval)) => {
+ // If it is equality predicate, calculate selectivity as `1 /
distinct_count`
Review Comment:
I think this new `1 / distinct_count` branch is a little too broad as
written. Right now it fires whenever the pruned interval collapses to a single
value, but that is not quite the same thing as proving we have an equality
filter.
For example, if the incoming stats already describe a singleton interval, or
if a conjunction of inequalities narrows the range to one point without
actually adding any selectivity beyond the existing stats, we would still scale
by `1 / NDV` here and end up under-estimating the row count.
Could we tighten this so the shortcut only applies when we really learned
something equality-specific? One possible way would be to compare against
`initial_interval` and only use this path when the initial interval was not
already that same singleton. A regression test around an already-singleton
input would also be really helpful here.
##########
datafusion/physical-expr/src/analysis.rs:
##########
@@ -277,8 +278,18 @@ fn calculate_selectivity(
let mut acc: f64 = 1.0;
for (initial, target) in initial_boundaries.iter().zip(target_boundaries) {
match (initial.interval.as_ref(), target.interval.as_ref()) {
- (Some(initial), Some(target)) => {
- acc *= cardinality_ratio(initial, target);
+ (Some(initial_interval), Some(target_interval)) => {
+ // If it is equality predicate, calculate selectivity as `1 /
distinct_count`
Review Comment:
Small readability suggestion: would it make sense to move this
singleton-selectivity logic into a tiny helper, maybe something like
`singleton_selectivity(initial_interval, target_interval, distinct_count)`?
I think that would make `calculate_selectivity` a bit easier to scan, and it
would give the equality-vs-singleton rules a single place to live once the
edge-case handling is tightened up.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]