findepi commented on code in PR #17476: URL: https://github.com/apache/datafusion/pull/17476#discussion_r2335808457
########## datafusion/physical-plan/src/joins/utils.rs: ########## @@ -563,11 +563,14 @@ fn estimate_inner_join_cardinality( .iter() .zip(right_stats.column_statistics.iter()) { - // Break if any of statistics bounds are undefined - if left_stat.min_value.get_value().is_none() - || left_stat.max_value.get_value().is_none() - || right_stat.min_value.get_value().is_none() - || right_stat.max_value.get_value().is_none() + // Break if we don't have enough information to calculate a distinct count + // If distinct_count isn't provided directly, we need min and max to be provided + if (left_stat.distinct_count.get_value().is_none() + && (left_stat.min_value.get_value().is_none() + || left_stat.max_value.get_value().is_none())) + || (right_stat.distinct_count.get_value().is_none() + && (right_stat.min_value.get_value().is_none() + || right_stat.max_value.get_value().is_none())) Review Comment: Would this be easier to read? ```suggestion let has_enough_info = |stat: &ColumnStatistics| { stat.distinct_count.get_value().is_some() || (stat.min_value.get_value().is_some() && stat.max_value.get_value().is_some()) }; if !has_enough_info(left_stat) || !has_enough_info(right_stat) ``` ########## datafusion/physical-plan/src/joins/utils.rs: ########## @@ -2016,20 +2019,20 @@ mod tests { ), // When we have distinct count. ( - (10, Inexact(1), Inexact(10), Inexact(10), Absent), - (10, Inexact(1), Inexact(10), Inexact(10), Absent), Review Comment: Why change existing test cases? Maybe let's just add new ones? ########## datafusion/physical-plan/src/joins/utils.rs: ########## @@ -2071,18 +2074,13 @@ mod tests { ), // No min or max (or both). ( - (10, Absent, Absent, Inexact(3), Absent), - (10, Absent, Absent, Inexact(3), Absent), - None, - ), - ( - (10, Inexact(2), Absent, Inexact(3), Absent), - (10, Absent, Inexact(5), Inexact(3), Absent), + (10, Absent, Absent, Absent, Absent), + (10, Absent, Absent, Absent, Absent), Review Comment: This became exact same as "No column level stats." input (one above). Let's not take changing inputs in existing test cases lightly. If they are mislabelled, perhaps let's just fix the labels -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org