UBarney commented on code in PR #14567:
URL: https://github.com/apache/datafusion/pull/14567#discussion_r1952755081


##########
datafusion/physical-optimizer/src/pruning.rs:
##########
@@ -1710,6 +1711,76 @@ fn build_like_match(
     Some(combined)
 }
 
+// For predicate `col NOT LIKE 'const_prefix%'`, we rewrite it as `(col_min 
NOT LIKE 'const_prefix%' OR col_max NOT LIKE 'const_prefix%')`. If both col_min 
and col_max have the prefix const_prefix, we skip the entire row group (as we 
can be certain that all data in this row group has the prefix const_prefix).
+fn build_not_like_match(
+    expr_builder: &mut PruningExpressionBuilder<'_>,
+) -> Result<Arc<dyn PhysicalExpr>> {
+    // col NOT LIKE 'const_prefix%' -> !(col_min LIKE 'const_prefix%' && 
col_max LIKE 'const_prefix%') -> (col_min NOT LIKE 'const_prefix%' || col_max 
NOT LIKE 'const_prefix%')

Review Comment:
   > this is not true either
   
   Why? Does it prune row groups that contain data that does not match the 
pattern? Or is it just inefficient?  
   
   First, let's clarify one point: if a row group contains any data that does 
not match the pattern, then this row group **must not** be pruned. The expected 
behavior is to return all data that does not match. If the row group gets 
pruned, data loss will occur (see [this 
PR](https://github.com/apache/datafusion/pull/561)).  
   
   ```
   For `col NOT LIKE 'const_prefix%'`:  
   It applies the condition:  
   `col_max < 'const_prefix' OR col_min >= 'const_prefiy'`  
   (Note that the last character of the rightmost constant is different.)  
   ```
   
   I think this mistakenly prunes row groups that contain data not matching the 
pattern.  
   
   Consider this case: `col NOT LIKE 'const_prefix%'`. The row group might 
include values such as `["aaa", "b", "const_prefix"]`. The condition `col_max < 
'const_prefix' OR col_min >= 'const_prefiy'` evaluates to `false`, causing the 
row group to be pruned. However, this is incorrect because `"aaa"` does not 
match the pattern and should be included in the result.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to