alamb commented on code in PR #18789:
URL: https://github.com/apache/datafusion/pull/18789#discussion_r2561161409
##########
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:
##########
@@ -1956,6 +1962,36 @@ impl<S: SimplifyInfo> TreeNodeRewriter for
Simplifier<'_, S> {
}))
}
+ // =======================================
Review Comment:
> Probably we can refactor all the operators like +, -, < to use ScalarUDF
interface? Fundamentally they're the all a node in the expression tree.
> The refactoring process can get a bit painful, but it's appealing to see a
unified interface for all the expressions in the long term.
Yes in the long term making all operators use ScalarUDF might make sense --
however it would be pretty painful as `Operator` is passed around many
different places -- we would have to extract that out however.
I have a suggestion for a slightly different API following ClickHouse's
model (`preimage`) that I think might make more sense for this particular case
##########
datafusion/sqllogictest/test_files/udf_preimage.slt:
##########
@@ -0,0 +1,170 @@
+# Licensed to the Apache Software Foundation (asF) under one
Review Comment:
I haven't reviewed the test coverage yet, but can we please ensure we have
added all the corresponding tests from duckdb (both initial PR and the fix):
- https://github.com/duckdb/duckdb/pull/18457
- https://github.com/duckdb/duckdb/pull/19628
We can probably copy most of the tests from the following places (they use
sqllogictests as well)
https://github.com/duckdb/duckdb/blob/main/test/optimizer/date_trunc_simplification_icu.test
https://github.com/duckdb/duckdb/blob/main/test/optimizer/date_trunc_simplification.test
##########
datafusion/optimizer/src/simplify_expressions/udf_preimage.rs:
##########
@@ -0,0 +1,407 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::str::FromStr;
+
+use arrow::compute::kernels::cast_utils::IntervalUnit;
+use datafusion_common::{internal_err, tree_node::Transformed, Result,
ScalarValue};
+use datafusion_expr::{
+ and, expr::ScalarFunction, lit, or, simplify::SimplifyInfo, BinaryExpr,
Expr,
+ Operator, ScalarUDFImpl,
+};
+use datafusion_functions::datetime::date_part::DatePartFunc;
+
+pub(super) fn preimage_in_comparison_for_binary(
+ info: &dyn SimplifyInfo,
+ udf_expr: Expr,
+ literal: Expr,
+ op: Operator,
+) -> Result<Transformed<Expr>> {
+ let (func, args, lit_value) = match (udf_expr, literal) {
+ (
+ Expr::ScalarFunction(ScalarFunction { func, args }),
+ Expr::Literal(lit_value, _),
+ ) => (func, args, lit_value),
+ _ => return internal_err!("Expect date_part expr and literal"),
+ };
+ let expr = Box::new(args[1].clone());
Review Comment:
You can assume that by this time the arguments will have been properly
coerced / and type checked
##########
datafusion/expr/src/udf.rs:
##########
@@ -696,6 +697,35 @@ pub trait ScalarUDFImpl: Debug + DynEq + DynHash + Send +
Sync {
Ok(ExprSimplifyResult::Original(args))
}
+ /// Attempts to convert a literal value to the corresponding datatype
+ /// of a column expression so that a **preimage** can be computed for
+ /// pruning comparison predicates.
+ ///
+ /// This is used during predicate-pushdown optimization
+ /// (see
`datafusion-optimizer-udf_preimage::preimage_in_comparison_for_binary`)
+ ///
+ /// Currently is only implemented by:
+ /// - `date_part(YEAR, expr)`
+ ///
+ /// # Arguments:
+ /// * `lit_value`: The literal `&ScalarValue` used in comparison
+ /// * `target_type`: The datatype of the column expression inside the
function
+ /// * `op`: The comparison `Operator` (e.g. `=`, `<`, `>=`).
+ ///
+ /// # Returns
+ ///
+ /// Returns a `ScalarValue` converted to the appropriate target type if a
+ /// preimage cast is supported for the given function/operator combination;
+ /// otherwise returns `None`.
+ fn preimage_cast(
Review Comment:
I dug around in ClickHouse and here is there corresponding API:
https://github.com/ClickHouse/ClickHouse/blob/9423252332fca10974a77c51273cf722f562bf4f/src/Functions/IFunction.h#L305-L308
And here is the implementation for the date/time transforms
https://github.com/ClickHouse/ClickHouse/blob/9423252332fca10974a77c51273cf722f562bf4f/src/Functions/DateTimeTransforms.h#L1406
https://github.com/ClickHouse/ClickHouse/blob/9423252332fca10974a77c51273cf722f562bf4f/src/Functions/DateTimeTransforms.h#L2286
And here is how it is used:
https://github.com/ClickHouse/ClickHouse/blob/9423252332fca10974a77c51273cf722f562bf4f/src/Analyzer/Passes/OptimizeDateOrDateTimeConverterWithPreimagePass.cpp#L126-L139
So in other words, perhaps we could make this API return a
[`Interval`](https://docs.rs/datafusion/latest/datafusion/logical_expr/interval_arithmetic/struct.Interval.html)
rather than rewriting the entire function internally
Something like this perhaps:
```rust
fn preimage(&self, args: &[Expr]) -> Option<Interval> {
None
}
```
##########
datafusion/optimizer/src/simplify_expressions/udf_preimage.rs:
##########
@@ -0,0 +1,407 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::str::FromStr;
+
+use arrow::compute::kernels::cast_utils::IntervalUnit;
+use datafusion_common::{internal_err, tree_node::Transformed, Result,
ScalarValue};
+use datafusion_expr::{
+ and, expr::ScalarFunction, lit, or, simplify::SimplifyInfo, BinaryExpr,
Expr,
+ Operator, ScalarUDFImpl,
+};
+use datafusion_functions::datetime::date_part::DatePartFunc;
+
+pub(super) fn preimage_in_comparison_for_binary(
+ info: &dyn SimplifyInfo,
+ udf_expr: Expr,
+ literal: Expr,
+ op: Operator,
+) -> Result<Transformed<Expr>> {
+ let (func, args, lit_value) = match (udf_expr, literal) {
+ (
+ Expr::ScalarFunction(ScalarFunction { func, args }),
+ Expr::Literal(lit_value, _),
+ ) => (func, args, lit_value),
+ _ => return internal_err!("Expect date_part expr and literal"),
+ };
+ let expr = Box::new(args[1].clone());
+
+ let Ok(expr_type) = info.get_data_type(&expr) else {
+ return internal_err!("Can't get the data type of the expr {:?}",
&expr);
+ };
+
+ let preimage_func = match func.name() {
+ "date_part" => DatePartFunc::new(),
+ _ => return internal_err!("Preimage is not supported for {:?}",
func.name()),
+ };
+
+ let rewritten_expr = match op {
+ Operator::Lt | Operator::GtEq => {
+ let v = match preimage_func.preimage_cast(&lit_value, &expr_type,
op) {
+ Some(v) => v,
+ None => {
+ return internal_err!("Could not cast literal to the column
type")
+ }
+ };
Review Comment:
yeah, I think you could avoid doing so by following the pattern in duckdb or
clickhouse
I think it goes something like
```
x = <expr>
```
If you can get a
[pre-image](https://en.wikipedia.org/wiki/Image_(mathematics)#Image_of_a_function)
for `<expr>` (aka know the entire single range for which it is valid) then you
can rewrite this to
```
x >= preimage_min && x <= preimage_max
```
And similarly for other operators 🤔
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]