adriangb commented on code in PR #14297:
URL: https://github.com/apache/datafusion/pull/14297#discussion_r1929591473


##########
datafusion/physical-optimizer/src/pruning.rs:
##########
@@ -819,16 +813,24 @@ impl RequiredColumns {
     /// statistics column, while keeping track that a reference to the 
statistics
     /// column is required
     ///
-    /// for example, an expression like `col("foo") > 5`, when called
+    /// For example, an expression like `col("foo") > 5`, when called
     /// with Max would result in an expression like `col("foo_max") >
-    /// 5` with the appropriate entry noted in self.columns
+    /// 5` with the appropriate entry noted in self.columns.
+    ///
+    /// Along with a reference to the min/max/null_count/row_count column
+    /// an expression for `{stats_col} IS NULL` is returned that should be 
`OR`ed
+    /// with any expressions applied to the column itself such that if any 
value in `{stats_col}`
+    /// is `null` we return `true` and not `null` because `null` is falsy, but 
we'd need to scan that
+    /// container, causing confusion and preventing optimizations because of 
the inverted logic!
+    ///
+    /// Thus for the example above we'd arrive at the final expression 
`foo_max is null or foo_max > 5`.
     fn stat_column_expr(
         &mut self,
         column: &phys_expr::Column,
         column_expr: &Arc<dyn PhysicalExpr>,
         field: &Field,
         stat_type: StatisticsType,
-    ) -> Result<Arc<dyn PhysicalExpr>> {
+    ) -> Result<(Arc<dyn PhysicalExpr>, Arc<dyn PhysicalExpr>)> {

Review Comment:
   The idea here is to force callers to do something with the null check 
expression, or explcitly discard it.
   This let me update all call sites with confidence of not having missed one, 
in addition to nicely abstracting away the logic.



##########
datafusion/physical-optimizer/src/pruning.rs:
##########
@@ -258,9 +258,7 @@ pub trait PruningStatistics {
 ///
 /// * Rows that evaluate to `false` are not returned (“filtered out” or 
“pruned” or “skipped”).
 ///
-/// * Rows that evaluate to `NULL` are **NOT** returned (also “filtered out”).
-///   Note: *this treatment of `NULL` is **DIFFERENT** than how `NULL` is 
treated
-///   in the rewritten predicate described below.*
+/// * No rows should evaluate to `null`, even if statistics are missing (the 
statistics are themselves `null`).

Review Comment:
   I have much more docstrings to update, including examples, but I will hold 
off on that until I get feedback on the rest of the PR.



##########
datafusion/physical-optimizer/src/pruning.rs:
##########
@@ -2785,7 +2893,7 @@ mod tests {
             vec![lit(1), lit(2), lit(3)],
             false,
         ));
-        let expected_expr = "c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 
1 AND 1 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 2 AND 
2 <= c1_max@1 OR c1_null_count@2 != c1_row_count@3 AND c1_min@0 <= 3 AND 3 <= 
c1_max@1";
+        let expected_expr = "(c1_null_count@2 IS NULL OR c1_row_count@3 IS 
NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 
1) AND (c1_max@1 IS NULL OR 1 <= c1_max@1) OR (c1_null_count@2 IS NULL OR 
c1_row_count@3 IS NULL OR c1_null_count@2 != c1_row_count@3) AND (c1_min@0 IS 
NULL OR c1_min@0 <= 2) AND (c1_max@1 IS NULL OR 2 <= c1_max@1) OR 
(c1_null_count@2 IS NULL OR c1_row_count@3 IS NULL OR c1_null_count@2 != 
c1_row_count@3) AND (c1_min@0 IS NULL OR c1_min@0 <= 3) AND (c1_max@1 IS NULL 
OR 3 <= c1_max@1)";

Review Comment:
   These expressions are long and gross... I wonder if we shouldn't do some 
combination of formatting the sql and using insta for snapshotting? Or can we 
serde dump the expressions and pretty format the json?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to