UBarney commented on code in PR #14567: URL: https://github.com/apache/datafusion/pull/14567#discussion_r1952755081
########## datafusion/physical-optimizer/src/pruning.rs: ########## @@ -1710,6 +1711,76 @@ fn build_like_match( Some(combined) } +// For predicate `col NOT LIKE 'const_prefix%'`, we rewrite it as `(col_min NOT LIKE 'const_prefix%' OR col_max NOT LIKE 'const_prefix%')`. If both col_min and col_max have the prefix const_prefix, we skip the entire row group (as we can be certain that all data in this row group has the prefix const_prefix). +fn build_not_like_match( + expr_builder: &mut PruningExpressionBuilder<'_>, +) -> Result<Arc<dyn PhysicalExpr>> { + // col NOT LIKE 'const_prefix%' -> !(col_min LIKE 'const_prefix%' && col_max LIKE 'const_prefix%') -> (col_min NOT LIKE 'const_prefix%' || col_max NOT LIKE 'const_prefix%') Review Comment: > this is not true either Why? Does it prune row groups that contain data that does not match the pattern? Or is it just inefficient? First, let's clarify one point: if a row group contains any data that does not match the pattern, then this row group **must not** be pruned. The expected behavior is to return all data that does not match. If the row group gets pruned, data loss will occur (see [this PR](https://github.com/apache/datafusion/pull/561)). ``` For `col NOT LIKE 'const_prefix%'`: It applies the condition: `col_max < 'const_prefix' OR col_min >= 'const_prefiy'` (Note that the last character of the rightmost constant is different.) ``` I think this mistakenly prunes row groups that contain data not matching the pattern. Consider this case: `col NOT LIKE 'const_prefix%'`. The row group might include values such as `["aaa", "b", "const_prefix"]`. The condition `col_max < 'const_prefix' OR col_min >= 'const_prefiy'` evaluates to `false`, causing the row group to be pruned. However, this is incorrect because `"aaa"` does not match the pattern and should be included in the result. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org