adriangb commented on code in PR #14567: URL: https://github.com/apache/datafusion/pull/14567#discussion_r1949194705
########## datafusion/physical-optimizer/src/pruning.rs: ########## @@ -1710,6 +1717,56 @@ fn build_like_match( Some(combined) } +// For predicate `col NOT LIKE 'foo%'`, we rewrite it as `(col_min NOT LIKE 'foo%' OR col_max NOT LIKE 'foo%')`. If both col_min and col_max have the prefix foo, we skip the entire row group (as we can be certain that all data in this row group has the prefix foo). +fn build_not_like_match( + expr_builder: &mut PruningExpressionBuilder<'_>, +) -> Option<Arc<dyn PhysicalExpr>> { + // col NOT LIKE 'prefix%' -> !(col_min LIKE 'prefix%' && col_max LIKE 'prefix%') -> (col_min NOT LIKE 'prefix%' || col_max NOT LIKE 'prefix%') + + let min_column_expr = expr_builder.min_column_expr().ok()?; + let max_column_expr = expr_builder.max_column_expr().ok()?; + + let scalar_expr = expr_builder.scalar_expr(); + + let pattern = extract_string_literal(scalar_expr)?; + + let chars: Vec<char> = pattern.chars().collect(); + for i in 0..chars.len() - 1 { + // Check if current char is a wildcard and is not escaped with backslash + if (chars[i] == '%' || chars[i] == '_') && (i == 0 || chars[i - 1] != '\\') { + // Example: For pattern "foo%bar", the row group might include values like + // ["foobar", "food", "foodbar"], making it unsafe to prune. Review Comment: Could you expand upon the examples here? My naive gut instinct (which I think is wrong, I tried before 😓) is that it should be able to truncate `foo%bar` to `foo%`. Maybe an example that can show what would go wrong if you tried that? ########## datafusion/core/tests/fuzz_cases/pruning.rs: ########## @@ -110,13 +110,27 @@ async fn test_utf8_not_like_prefix() { .await; } +#[tokio::test] +async fn test_utf8_not_like_ecsape() { + Utf8Test::new(|value| col("a").not_like(lit(format!("\\%{}%", value)))) + .run() + .await; +} Review Comment: Love these fuzz tests! These 3 lines of code give me great confidence that this PR does the right thing 😄 ########## datafusion/physical-optimizer/src/pruning.rs: ########## @@ -1590,6 +1590,13 @@ fn build_statistics_expr( )), )) } + Operator::NotLikeMatch => { + build_not_like_match(expr_builder).ok_or_else(|| { + plan_datafusion_err!( + "The NOT LIKE expression with wildcards is only supported at the end of the pattern" Review Comment: Just a general note, not necessarily for this PR: I do think we should somehow make these errors (also for the existing LIKE case) a bit better if there is any other reason they might be rejected? Or if this is the only reason, a more generic error might be better? I realize these errors likely never surface to users but I think it'd be unfortunate to see an error like this one when the actual reason it wasn't supported has nothing to do with it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org